[jira] [Comment Edited] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306688#comment-14306688 ] Manoj Kumar edited comment on SPARK-5016 at 2/5/15 8:09 AM: Hi, I would like to fix this (since I'm familiar to an extent with this part of the code) and maybe we could merge this before the sparseinput issue. 1. As a heuristic, how large should k be? 2. By distribute do you mean, to store samples (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L140) as a collection using sc.parallelize, so that it can be operated on paraalel across k? What role does n_features have? Thanks. was (Author: mechcoder): Hi, I would like to fix this (since I'm familiar to an extent with this part of the code) and maybe we could merge this before the sparseinput issue. 1. As a heuristic, how large should k be? 2. By distribute do you mean, to store samples (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L140) as a collection using sc.parallelize, so that it can be operated on paraalel across k. Thanks. GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5604) Remove setCheckpointDir from LDA and tree Strategy
[ https://issues.apache.org/jira/browse/SPARK-5604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306810#comment-14306810 ] Apache Spark commented on SPARK-5604: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4390 Remove setCheckpointDir from LDA and tree Strategy -- Key: SPARK-5604 URL: https://issues.apache.org/jira/browse/SPARK-5604 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Continue the discussion from the LDA PR. CheckpoingDir is a global Spark configuration, which should not be altered by an ML algorithm. We could check whether checkpointDir is set if checkpointInterval is positive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308596#comment-14308596 ] Shekhar Bansal commented on SPARK-5081: --- I faced same problem, moving to lz4 compression did the trick for me. try spark.io.compression.codec=lz4 Shuffle write increases --- Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. spark 1.1 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |9|saveAsTextFile| |1169.4KB| | |12|combineByKey| |1265.4KB|1275.0KB| |6|sortByKey| |1276.5KB| | |8|mapPartitions| |91.0MB|1383.1KB| |4|apply| |89.4MB| | |5|sortBy|155.6MB| |98.1MB| |3|sortBy|155.6MB| | | |1|collect| |2.1MB| | |2|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | spark 1.2 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |12|saveAsTextFile| |1170.2KB| | |11|combineByKey| |1264.5KB|1275.0KB| |8|sortByKey| |1273.6KB| | |7|mapPartitions| |134.5MB|1383.1KB| |5|zipWithIndex| |132.5MB| | |4|sortBy|155.6MB| |146.9MB| |3|sortBy|155.6MB| | | |2|collect| |2.0MB| | |1|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308598#comment-14308598 ] Andrew Or commented on SPARK-5388: -- [~tigerquoll] I still don't think we should use DELETE for kill for the following reason. In normal REST servers that host static resources, if you GET after a DELETE, you run into a 404. Here, our resources are by no means static, and if you GET after a DELETE you actually get a different status (that your driver is now KILLED instead of RUNNING) instead. Because of these side-effects I think it is safest to use POST. [~vanzin] - The action field is actually required especially since many of the responses look quite alike. We need to know how to deserialize the messages safely in case the response we get from the server is not the type that we expect it to be (e.g. ErrorResponse). - Yes, I could rename the protocolVersion field. - The issue with having non-String types is that you will need to deal with numeric and boolean values specially. For instance, if the user does not explicitly set the field there is no easy way to not include it in the JSON without doing some Option hack. I went down that route and opted out for simpler code. - The unknown fields reporting is added in the PR but is missing in the spec. In the PR it is reported in its own explicit field. - Even in the existing interface you can use o.a.s.deploy.Client to kill an application, and the security guarantees there are the same. I agree that it is something we need to address that at some point, but I prefer to keep that outside the scope of this patch. Provide a stable application submission gateway in standalone cluster mode -- Key: SPARK-5388 URL: https://issues.apache.org/jira/browse/SPARK-5388 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Attachments: stable-spark-submit-in-standalone-mode-2-4-15.pdf The existing submission gateway in standalone mode is not compatible across Spark versions. If you have a newer version of Spark submitting to an older version of the standalone Master, it is currently not guaranteed to work. The goal is to provide a stable REST interface to replace this channel. For more detail, please see the most recent design doc attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308615#comment-14308615 ] Guoqiang Li commented on SPARK-5556: LightLDA's computational complexity is O(1) The paper: http://arxiv.org/abs/1412.1576 The code(work in progress): https://github.com/witgo/spark/tree/LightLDA Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5598) Model import/export for ALS
[ https://issues.apache.org/jira/browse/SPARK-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308692#comment-14308692 ] Sean Owen commented on SPARK-5598: -- [~mengxr] No, no other tool could usefully read such a PMML file. The only argument for it would be consistency: you probably need *some* file to hold some metadata about the model, so, you could just use PMML rather than also invent another format for that too. The actual data can't feasibly be serialized in PMML since it would be far too large as XML. I'm not suggesting that text-based serialization of the vectors should be used; I was pointing more to the PMML container idea. Yes, if this only concerns data that will only be written/read by Spark, and is not intended for export, there isn't any value at all in PMML. I thought this might be covering model export, meaning, for some kind of external consumption. In that case, there's no good answer, but at least reusing PMML for the container could have small value. Model import/export for ALS --- Key: SPARK-5598 URL: https://issues.apache.org/jira/browse/SPARK-5598 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Xiangrui Meng Please see parent JIRA for details on model import/export plans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-926) spark_ec2 script when ssh/scp-ing should pipe UserknowHostFile to /dev/null
[ https://issues.apache.org/jira/browse/SPARK-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-926. - Resolution: Duplicate Going to make this one the duplicate since SPARK-5403 has an active PR. spark_ec2 script when ssh/scp-ing should pipe UserknowHostFile to /dev/null --- Key: SPARK-926 URL: https://issues.apache.org/jira/browse/SPARK-926 Project: Spark Issue Type: New Feature Components: EC2 Affects Versions: 0.8.0 Reporter: Shay Seng Priority: Trivial The know host file in the local machine gets all kinds of crap after a few cluster launches. When SSHing, or SCPing, please add -o UserKnowHostFile=/dev/null Also remove the -t option from SSH, and only add in when necessary - to reduce chatter on console. e.g. # Copy a file to a given host through scp, throwing an exception if scp fails def scp(host, opts, local_file, dest_file): subprocess.check_call( scp -q -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i %s '%s' '%s@%s:%s' % (opts.identity_file, local_file, opts.user, host, dest_file), shell=True) # Run a command on a host through ssh, retrying up to two times # and then throwing an exception if ssh continues to fail. def ssh(host, opts, command, sshopts=): tries = 0 while True: try: # removed -t option from ssh command, not sure why it is required all the time. return subprocess.check_call( ssh %s -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i %s %s@%s '%s' % (sshopts, opts.identity_file, opts.user, host, command), shell=True) except subprocess.CalledProcessError as e: if (tries 2): raise e print Couldn't connect to host {0}, waiting 30 seconds.format(e) time.sleep(30) tries = tries + 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5647) Output metrics do not show up for older hadoop versions ( 2.5)
Kostas Sakellis created SPARK-5647: -- Summary: Output metrics do not show up for older hadoop versions ( 2.5) Key: SPARK-5647 URL: https://issues.apache.org/jira/browse/SPARK-5647 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Kostas Sakellis Need to add output metrics for hadoop 2.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308584#comment-14308584 ] Apache Spark commented on SPARK-5563: - User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/4419 LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4279) Implementing TinkerPop on top of GraphX
[ https://issues.apache.org/jira/browse/SPARK-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308595#comment-14308595 ] Jianshi Huang commented on SPARK-4279: -- Anyone is working on this? Implementing TinkerPop on top of GraphX --- Key: SPARK-4279 URL: https://issues.apache.org/jira/browse/SPARK-4279 Project: Spark Issue Type: New Feature Components: GraphX Reporter: Brennon York Priority: Minor [TinkerPop|https://github.com/tinkerpop] is a great abstraction for graph databases and has been implemented across various graph database backends. Has anyone thought about integrating the TinkerPop framework with GraphX to enable GraphX as another backend? Not sure if this has been brought up or not, but would certainly volunteer to spearhead this effort if the community thinks it to be a good idea. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308603#comment-14308603 ] Pedro Rodriguez commented on SPARK-5556: Posting here as a status update. I will be working on and opening a pull request for adding a collapsed Gibbs sampling version which uses FastLDA for super linear scaling with number of topics. Below is the design document (same as from the original LDA JIRA issue), along with the repository/branch I am working on. https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing https://github.com/EntilZha/spark/tree/LDA-Refactor Tasks * Rebase from the merged implementation, refactor appropriately * Merge/implement the required inheritance/trait/abstract classes to support two implementations (EM and Gibbs) using only the entry points exposed in the EM version, plus an optional argument to select between EM/Gibbs. * Do performance tests comparable to those run for EM LDA. Some details for inheritance/trait/abstract: General idea would be to create an API which LDA implementations must satisfy using a trait/abstract class. All implementation details would be encapsulated within a state object satisfying the trait/abstract class. LDA would be responsible for creating an EM or Gibbs state object based on a user argument switch/flag. Linked below is a sample implementation based on an earlier version of the merged EM code (which needs to be updated to reflect the changes since then, but it should show the idea well enough): https://github.com/EntilZha/spark/blob/LDA-Refactor/mllib/src/main/scala/org/apache/spark/mllib/topicmodeling/LDA.scala#L216-L242 Timeline: I have been busier than expected, but rebase/refactoring should be done in the next few days, then I will open a PR to get feedback while running performance tests. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4279) Implementing TinkerPop on top of GraphX
[ https://issues.apache.org/jira/browse/SPARK-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308651#comment-14308651 ] Sean Owen commented on SPARK-4279: -- This sounds like something that should live outside Spark, no? I suggest closing this. Implementing TinkerPop on top of GraphX --- Key: SPARK-4279 URL: https://issues.apache.org/jira/browse/SPARK-4279 Project: Spark Issue Type: New Feature Components: GraphX Reporter: Brennon York Priority: Minor [TinkerPop|https://github.com/tinkerpop] is a great abstraction for graph databases and has been implemented across various graph database backends. Has anyone thought about integrating the TinkerPop framework with GraphX to enable GraphX as another backend? Not sure if this has been brought up or not, but would certainly volunteer to spearhead this effort if the community thinks it to be a good idea. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5625) Spark binaries do not incude Spark Core
[ https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5625. -- Resolution: Not a Problem All of these distributions include an assembly JAR with the entire Spark codebase. None are supposed to contain individual artifacts. Spark binaries do not incude Spark Core --- Key: SPARK-5625 URL: https://issues.apache.org/jira/browse/SPARK-5625 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: CDH4 Reporter: DeepakVohra Spark binaries for CDH 4 do not include the Spark Core Jar. http://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5635) Allow users to run .scala files directly from spark-submit
[ https://issues.apache.org/jira/browse/SPARK-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308776#comment-14308776 ] Sean Owen commented on SPARK-5635: -- spark-shell uses spark-submit, and spark-shell is the already the thing that can ingest source code. You can run a .scala file as you say already this way. What is needed beyond this? Allow users to run .scala files directly from spark-submit -- Key: SPARK-5635 URL: https://issues.apache.org/jira/browse/SPARK-5635 Project: Spark Issue Type: New Feature Components: Spark Core, Spark Shell Reporter: Grant Henke Priority: Minor Similar to the python functionality allow users to submit .scala files. Currently the way I simulate this is to use spark-shell and run: `spark-shell -i myscript.scala` Note: user needs to add exit to the bottom of the script. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5645) Track local bytes read for shuffles - update UI
Kostas Sakellis created SPARK-5645: -- Summary: Track local bytes read for shuffles - update UI Key: SPARK-5645 URL: https://issues.apache.org/jira/browse/SPARK-5645 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Kostas Sakellis Currently we do not track the local bytes read for a shuffle read. The UI only shows the remote bytes read. This is pretty confusing to the user because: 1) In local mode all shuffle reads are local 2) the shuffle bytes written from the previous stage might not add up if there are some bytes that are read locally on the shuffle read side 3) With https://github.com/apache/spark/pull/4067 we display the total number of records so that won't line up with only showing the remote bytes read. I propose we track the remote and local bytes read separately. In the UI show the total bytes read and in brackets show the remote bytes read for a shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308599#comment-14308599 ] Andrew Or edited comment on SPARK-5388 at 2/6/15 5:35 AM: -- By the way for the more specific comments it would be good if you can leave them on the PR itself: https://github.com/apache/spark/pull/4216. The specs and the actual code will diverge after some review so the most up-to-date version will likely be there. was (Author: andrewor14): By the way for the more specific comments it would be good if you can leave them on the PR itself: https://github.com/apache/spark/pull/4216. The specs and the actual code will likely diverge after some review so the most up-to-date version will likely be there. Provide a stable application submission gateway in standalone cluster mode -- Key: SPARK-5388 URL: https://issues.apache.org/jira/browse/SPARK-5388 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Attachments: stable-spark-submit-in-standalone-mode-2-4-15.pdf The existing submission gateway in standalone mode is not compatible across Spark versions. If you have a newer version of Spark submitting to an older version of the standalone Master, it is currently not guaranteed to work. The goal is to provide a stable REST interface to replace this channel. For more detail, please see the most recent design doc attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5643) Add a show method to print the content of a DataFrame in columnar format
Reynold Xin created SPARK-5643: -- Summary: Add a show method to print the content of a DataFrame in columnar format Key: SPARK-5643 URL: https://issues.apache.org/jira/browse/SPARK-5643 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5644) Delete tmp dir when sc is stop
Weizhong created SPARK-5644: --- Summary: Delete tmp dir when sc is stop Key: SPARK-5644 URL: https://issues.apache.org/jira/browse/SPARK-5644 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Weizhong Priority: Minor When we run driver as a service which will not stop. In this service process we will create SparkContext and run job and then stop it, because we only call sc.stop but not exit this service process so the tmp dirs created by HttpFileServer and SparkEnv will not be deleted after SparkContext is stopped, and this will lead to creating too many tmp dirs if we create many SparkContext to run job in this service process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5031) ml.LogisticRegression score column should be renamed probability
[ https://issues.apache.org/jira/browse/SPARK-5031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5031. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3637 [https://github.com/apache/spark/pull/3637] ml.LogisticRegression score column should be renamed probability Key: SPARK-5031 URL: https://issues.apache.org/jira/browse/SPARK-5031 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor Fix For: 1.3.0 In the spark.ml package, LogisticRegression has an output column score which contains the estimated probability of label 1. Score is a very overloaded term, so probability would be better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4942) ML Transformers should allow output cols to be turned on,off
[ https://issues.apache.org/jira/browse/SPARK-4942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4942. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3637 [https://github.com/apache/spark/pull/3637] ML Transformers should allow output cols to be turned on,off Key: SPARK-4942 URL: https://issues.apache.org/jira/browse/SPARK-4942 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Fix For: 1.3.0 ML Transformers will eventually output multiple columns (e.g., predicted labels, predicted confidences, probabilities, etc.). These columns should be optional. Benefits: * more efficient (though Spark SQL may be able to optimize) * cleaner column namespace if people do not want all output columns Proposal: * If a column name parameter (e.g., predictionCol) is an empty string, then do not output that column. This will require updating validateAndTransformSchema() to ignore empty output column names in addition to updating transform(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4789) Standardize ML Prediction APIs
[ https://issues.apache.org/jira/browse/SPARK-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4789. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3637 [https://github.com/apache/spark/pull/3637] Standardize ML Prediction APIs -- Key: SPARK-4789 URL: https://issues.apache.org/jira/browse/SPARK-4789 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Fix For: 1.3.0 Create a standard set of abstractions for prediction in spark.ml. This will follow the design doc specified in [SPARK-3702]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5616) Add examples for PySpark API
[ https://issues.apache.org/jira/browse/SPARK-5616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308555#comment-14308555 ] Apache Spark commented on SPARK-5616: - User 'lazyman500' has created a pull request for this issue: https://github.com/apache/spark/pull/4417 Add examples for PySpark API Key: SPARK-5616 URL: https://issues.apache.org/jira/browse/SPARK-5616 Project: Spark Issue Type: Improvement Components: PySpark Reporter: dongxu Labels: examples, pyspark, python Fix For: 1.3.0 PySpark API examples are less than Spark scala API. For example: 1.Boardcast: how to use boardcast operation APi 2.Module: how to import a other python file in zip file. Add more examples for freshman who wanna use PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308599#comment-14308599 ] Andrew Or commented on SPARK-5388: -- By the way for the more specific comments it would be good if you can leave them on the PR itself: https://github.com/apache/spark/pull/4216. The specs and the actual code will likely diverge after some review so the most up-to-date version will likely be there. Provide a stable application submission gateway in standalone cluster mode -- Key: SPARK-5388 URL: https://issues.apache.org/jira/browse/SPARK-5388 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Attachments: stable-spark-submit-in-standalone-mode-2-4-15.pdf The existing submission gateway in standalone mode is not compatible across Spark versions. If you have a newer version of Spark submitting to an older version of the standalone Master, it is currently not guaranteed to work. The goal is to provide a stable REST interface to replace this channel. For more detail, please see the most recent design doc attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308608#comment-14308608 ] Kevin Jung commented on SPARK-5081: --- Sorry, I will make an effort to provide another code to replay this problem because I don't have the old code anymore. Shuffle write increases --- Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. spark 1.1 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |9|saveAsTextFile| |1169.4KB| | |12|combineByKey| |1265.4KB|1275.0KB| |6|sortByKey| |1276.5KB| | |8|mapPartitions| |91.0MB|1383.1KB| |4|apply| |89.4MB| | |5|sortBy|155.6MB| |98.1MB| |3|sortBy|155.6MB| | | |1|collect| |2.1MB| | |2|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | spark 1.2 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |12|saveAsTextFile| |1170.2KB| | |11|combineByKey| |1264.5KB|1275.0KB| |8|sortByKey| |1273.6KB| | |7|mapPartitions| |134.5MB|1383.1KB| |5|zipWithIndex| |132.5MB| | |4|sortBy|155.6MB| |146.9MB| |3|sortBy|155.6MB| | | |2|collect| |2.0MB| | |1|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308635#comment-14308635 ] Patrick Wendell commented on SPARK-5388: I think it's reasonable to use DELETE per [~tigerquoll]'s suggestion. It's not a perfect match with DELETE semantics, but I think it's fine to use it if it's not too much work. I also think calling it maxProtocolVersion is a good idea if those are indeed the semantics. For security, yeah the killing is the same as it is in the current mode, which is that there is no security. One thing we could do if there is user demand is add a flag that globally disables killing, but let's see if users request this first. Provide a stable application submission gateway in standalone cluster mode -- Key: SPARK-5388 URL: https://issues.apache.org/jira/browse/SPARK-5388 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Attachments: stable-spark-submit-in-standalone-mode-2-4-15.pdf The existing submission gateway in standalone mode is not compatible across Spark versions. If you have a newer version of Spark submitting to an older version of the standalone Master, it is currently not guaranteed to work. The goal is to provide a stable REST interface to replace this channel. For more detail, please see the most recent design doc attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dai updated SPARK-5563: - Assignee: yuhao yang LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: yuhao yang Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5566) Tokenizer for mllib package
[ https://issues.apache.org/jira/browse/SPARK-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308733#comment-14308733 ] yuhao yang commented on SPARK-5566: --- I mean only the underlying implementation. Tokenizer for mllib package --- Key: SPARK-5566 URL: https://issues.apache.org/jira/browse/SPARK-5566 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley There exist tokenizer classes in the spark.ml.feature package and in the LDAExample in the spark.examples.mllib package. The Tokenizer in the LDAExample is more advanced and should be made into a full-fledged public class in spark.mllib.feature. The spark.ml.feature.Tokenizer class should become a wrapper around the new Tokenizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5644) Delete tmp dir when sc is stop
[ https://issues.apache.org/jira/browse/SPARK-5644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308719#comment-14308719 ] Apache Spark commented on SPARK-5644: - User 'Sephiroth-Lin' has created a pull request for this issue: https://github.com/apache/spark/pull/4412 Delete tmp dir when sc is stop -- Key: SPARK-5644 URL: https://issues.apache.org/jira/browse/SPARK-5644 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Weizhong Priority: Minor When we run driver as a service which will not stop. In this service process we will create SparkContext and run job and then stop it, because we only call sc.stop but not exit this service process so the tmp dirs created by HttpFileServer and SparkEnv will not be deleted after SparkContext is stopped, and this will lead to creating too many tmp dirs if we create many SparkContext to run job in this service process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308620#comment-14308620 ] Kevin Jung commented on SPARK-5081: --- To test under the same condition, I set this to snappy for all spark version but this problem occurs. AFA I know, lz4 needs more CPU time than snappy but it has better compression ratio. Shuffle write increases --- Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. spark 1.1 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |9|saveAsTextFile| |1169.4KB| | |12|combineByKey| |1265.4KB|1275.0KB| |6|sortByKey| |1276.5KB| | |8|mapPartitions| |91.0MB|1383.1KB| |4|apply| |89.4MB| | |5|sortBy|155.6MB| |98.1MB| |3|sortBy|155.6MB| | | |1|collect| |2.1MB| | |2|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | spark 1.2 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |12|saveAsTextFile| |1170.2KB| | |11|combineByKey| |1264.5KB|1275.0KB| |8|sortByKey| |1273.6KB| | |7|mapPartitions| |134.5MB|1383.1KB| |5|zipWithIndex| |132.5MB| | |4|sortBy|155.6MB| |146.9MB| |3|sortBy|155.6MB| | | |2|collect| |2.0MB| | |1|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308619#comment-14308619 ] Pedro Rodriguez commented on SPARK-5556: I will read that paper, seems interesting. Probably worth discussing at some point, how is the philosophy behind supporting different algorithms? It seems like there are a good number (at least 2 Gibbs, 1 EM right now). On the same line of thought, perhaps it would be better to open two pull requests, one which refactors the current LDA to allow multiple algorithms, and a second for the Gibbs itself? Thoughts? Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER
[ https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308644#comment-14308644 ] Florian Verhein commented on SPARK-3185: [~dvohra] Sure, but the exception is thrown by tachyon... so you're not going to be able to fix it by changing the spark build SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER --- Key: SPARK-3185 URL: https://issues.apache.org/jira/browse/SPARK-3185 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.0.2 Environment: Amazon Linux AMI [ec2-user@ip-172-30-1-145 ~]$ uname -a Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/ The build I used (and MD5 verified): [ec2-user@ip-172-30-1-145 ~]$ wget http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz Reporter: Jeremy Chambers {code} org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 {code} When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon exception is thrown when Formatting JOURNAL_FOLDER. No exception occurs when I launch on Hadoop 1. Launch used: {code} ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch sparkProd {code} {code} log snippet Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/ Exception in thread main java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73) at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53) at tachyon.UnderFileSystem.get(UnderFileSystem.java:53) at tachyon.Format.main(Format.java:54) Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69) ... 3 more Killed 0 processes Killed 0 processes ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes ---end snippet--- {code} *I don't have this problem when I launch without the --hadoop-major-version=2 (which defaults to Hadoop 1.x).* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4808) Spark fails to spill with small number of large objects
[ https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308677#comment-14308677 ] Apache Spark commented on SPARK-4808: - User 'mingyukim' has created a pull request for this issue: https://github.com/apache/spark/pull/4420 Spark fails to spill with small number of large objects --- Key: SPARK-4808 URL: https://issues.apache.org/jira/browse/SPARK-4808 Project: Spark Issue Type: Bug Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1 Reporter: Dennis Lawler Spillable's maybeSpill does not allow spill to occur until at least 1000 elements have been spilled, and then will only evaluate spill every 32nd element thereafter. When there is a small number of very large items being tracked, out-of-memory conditions may occur. I suspect that this and the every-32nd-element behavior was to reduce the impact of the estimateSize() call. This method was extracted into SizeTracker, which implements its own exponential backup for size estimation, so now we are only avoiding using the resulting estimated size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5639) Support DataFrame.renameColumn
[ https://issues.apache.org/jira/browse/SPARK-5639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5639. Resolution: Fixed Fix Version/s: 1.3.0 Support DataFrame.renameColumn -- Key: SPARK-5639 URL: https://issues.apache.org/jira/browse/SPARK-5639 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.3.0 It is incredibly hard to rename a column using the existing DSL. Let's provide that out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5391) SparkSQL fails to create tables with custom JSON SerDe
[ https://issues.apache.org/jira/browse/SPARK-5391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308627#comment-14308627 ] Muthupandi K edited comment on SPARK-5391 at 2/6/15 5:13 AM: - Same error occurred when a table is created with json serde in hive table and queried from SparkQL. was (Author: muthu): Same error occoured when a table is created with json serde in hive table and queried from SparkQL. SparkSQL fails to create tables with custom JSON SerDe -- Key: SPARK-5391 URL: https://issues.apache.org/jira/browse/SPARK-5391 Project: Spark Issue Type: Bug Components: SQL Reporter: David Ross - Using Spark built from trunk on this commit: https://github.com/apache/spark/commit/bc20a52b34e826895d0dcc1d783c021ebd456ebd - Build for Hive13 - Using this JSON serde: https://github.com/rcongiu/Hive-JSON-Serde First download jar locally: {code} $ curl http://www.congiu.net/hive-json-serde/1.3/cdh5/json-serde-1.3-jar-with-dependencies.jar /tmp/json-serde-1.3-jar-with-dependencies.jar {code} Then add it in SparkSQL session: {code} add jar /tmp/json-serde-1.3-jar-with-dependencies.jar {code} Finally create table: {code} create table test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'; {code} Logs for add jar: {code} 15/01/23 23:48:33 INFO thriftserver.SparkExecuteStatementOperation: Running query 'add jar /tmp/json-serde-1.3-jar-with-dependencies.jar' 15/01/23 23:48:34 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/01/23 23:48:34 INFO SessionState: Added /tmp/json-serde-1.3-jar-with-dependencies.jar to class path 15/01/23 23:48:34 INFO SessionState: Added resource: /tmp/json-serde-1.3-jar-with-dependencies.jar 15/01/23 23:48:34 INFO spark.SparkContext: Added JAR /tmp/json-serde-1.3-jar-with-dependencies.jar at http://192.168.99.9:51312/jars/json-serde-1.3-jar-with-dependencies.jar with timestamp 1422056914776 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result Schema: List() 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result Schema: List() {code} Logs (with error) for create table: {code} 15/01/23 23:49:00 INFO thriftserver.SparkExecuteStatementOperation: Running query 'create table test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'' 15/01/23 23:49:00 INFO parse.ParseDriver: Parsing command: create table test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed 15/01/23 23:49:01 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=compile from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO parse.ParseDriver: Parsing command: create table test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=parse start=1422056941103 end=1422056941104 duration=1 from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Starting Semantic Analysis 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Creating table test_json position=13 15/01/23 23:49:01 INFO ql.Driver: Semantic Analysis Completed 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=semanticAnalyze start=1422056941104 end=1422056941240 duration=136 from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null) 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=compile start=1422056941071 end=1422056941252 duration=181 from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=Driver.execute from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO ql.Driver: Starting command: create table test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=TimeToSubmit start=1422056941067 end=1422056941258 duration=191
[jira] [Commented] (SPARK-5391) SparkSQL fails to create tables with custom JSON SerDe
[ https://issues.apache.org/jira/browse/SPARK-5391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308627#comment-14308627 ] Muthupandi K commented on SPARK-5391: - Same error occoured when a table is created with json serde in hive table and queried from SparkQL. SparkSQL fails to create tables with custom JSON SerDe -- Key: SPARK-5391 URL: https://issues.apache.org/jira/browse/SPARK-5391 Project: Spark Issue Type: Bug Components: SQL Reporter: David Ross - Using Spark built from trunk on this commit: https://github.com/apache/spark/commit/bc20a52b34e826895d0dcc1d783c021ebd456ebd - Build for Hive13 - Using this JSON serde: https://github.com/rcongiu/Hive-JSON-Serde First download jar locally: {code} $ curl http://www.congiu.net/hive-json-serde/1.3/cdh5/json-serde-1.3-jar-with-dependencies.jar /tmp/json-serde-1.3-jar-with-dependencies.jar {code} Then add it in SparkSQL session: {code} add jar /tmp/json-serde-1.3-jar-with-dependencies.jar {code} Finally create table: {code} create table test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'; {code} Logs for add jar: {code} 15/01/23 23:48:33 INFO thriftserver.SparkExecuteStatementOperation: Running query 'add jar /tmp/json-serde-1.3-jar-with-dependencies.jar' 15/01/23 23:48:34 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/01/23 23:48:34 INFO SessionState: Added /tmp/json-serde-1.3-jar-with-dependencies.jar to class path 15/01/23 23:48:34 INFO SessionState: Added resource: /tmp/json-serde-1.3-jar-with-dependencies.jar 15/01/23 23:48:34 INFO spark.SparkContext: Added JAR /tmp/json-serde-1.3-jar-with-dependencies.jar at http://192.168.99.9:51312/jars/json-serde-1.3-jar-with-dependencies.jar with timestamp 1422056914776 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result Schema: List() 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result Schema: List() {code} Logs (with error) for create table: {code} 15/01/23 23:49:00 INFO thriftserver.SparkExecuteStatementOperation: Running query 'create table test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'' 15/01/23 23:49:00 INFO parse.ParseDriver: Parsing command: create table test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed 15/01/23 23:49:01 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=compile from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO parse.ParseDriver: Parsing command: create table test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=parse start=1422056941103 end=1422056941104 duration=1 from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Starting Semantic Analysis 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Creating table test_json position=13 15/01/23 23:49:01 INFO ql.Driver: Semantic Analysis Completed 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=semanticAnalyze start=1422056941104 end=1422056941240 duration=136 from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null) 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=compile start=1422056941071 end=1422056941252 duration=181 from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=Driver.execute from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO ql.Driver: Starting command: create table test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=TimeToSubmit start=1422056941067 end=1422056941258 duration=191 from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=runTasks from=org.apache.hadoop.hive.ql.Driver 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG
[jira] [Resolved] (SPARK-5631) Server IPC version 7 cannot communicate with client version 4
[ https://issues.apache.org/jira/browse/SPARK-5631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5631. -- Resolution: Not a Problem The right place to ask questions and discuss this is the mailing list. This means you have mismatched Hadoop versions, either between your Spark and Hadoop deployment, or because you included Hadoop code in your app. Server IPC version 7 cannot communicate with client version 4 -- Key: SPARK-5631 URL: https://issues.apache.org/jira/browse/SPARK-5631 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: Scala 2.10.4 Spark 1.2 CDH4.2 Reporter: DeepakVohra A Spark application generates the error Server IPC version 7 cannot communicate with client version 4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5531) Spark download .tgz file does not get unpacked
[ https://issues.apache.org/jira/browse/SPARK-5531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5531. -- Resolution: Not a Problem Spark download .tgz file does not get unpacked -- Key: SPARK-5531 URL: https://issues.apache.org/jira/browse/SPARK-5531 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: Linux Reporter: DeepakVohra The spark-1.2.0-bin-cdh4.tgz file downloaded from http://spark.apache.org/downloads.html does not get unpacked. tar xvf spark-1.2.0-bin-cdh4.tgz gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error is not recoverable: exiting now -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5645) Track local bytes read for shuffles - update UI
[ https://issues.apache.org/jira/browse/SPARK-5645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-5645: -- Assignee: Kostas Sakellis Track local bytes read for shuffles - update UI --- Key: SPARK-5645 URL: https://issues.apache.org/jira/browse/SPARK-5645 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Kostas Sakellis Assignee: Kostas Sakellis Currently we do not track the local bytes read for a shuffle read. The UI only shows the remote bytes read. This is pretty confusing to the user because: 1) In local mode all shuffle reads are local 2) the shuffle bytes written from the previous stage might not add up if there are some bytes that are read locally on the shuffle read side 3) With https://github.com/apache/spark/pull/4067 we display the total number of records so that won't line up with only showing the remote bytes read. I propose we track the remote and local bytes read separately. In the UI show the total bytes read and in brackets show the remote bytes read for a shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5643) Add a show method to print the content of a DataFrame in columnar format
[ https://issues.apache.org/jira/browse/SPARK-5643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308543#comment-14308543 ] Apache Spark commented on SPARK-5643: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4416 Add a show method to print the content of a DataFrame in columnar format Key: SPARK-5643 URL: https://issues.apache.org/jira/browse/SPARK-5643 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5646) Record output metrics for cache
Kostas Sakellis created SPARK-5646: -- Summary: Record output metrics for cache Key: SPARK-5646 URL: https://issues.apache.org/jira/browse/SPARK-5646 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Kostas Sakellis We currently show the input metrics when coming from the cache but we don't track/show the output metrics when we write to the cache -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5646) Record output metrics for cache
[ https://issues.apache.org/jira/browse/SPARK-5646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-5646: -- Assignee: Kostas Sakellis Record output metrics for cache --- Key: SPARK-5646 URL: https://issues.apache.org/jira/browse/SPARK-5646 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Kostas Sakellis Assignee: Kostas Sakellis We currently show the input metrics when coming from the cache but we don't track/show the output metrics when we write to the cache -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307196#comment-14307196 ] Travis Galoppo commented on SPARK-5021: --- [~MechCoder] It is probably better to get something working, submit a PR (perhaps mark it [WIP]) and work out the kinks in the review process. GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307406#comment-14307406 ] Twinkle Sachdeva commented on SPARK-4705: - Hi [~vanzin], Regarding adding that for other modes, I just need to override an API, after figuring a bit about getting attempt id. I will plan for that. Thanks for the html stuff, will upload the UI snapshot too. Driver retries in yarn-cluster mode always fail if event logging is enabled --- Key: SPARK-4705 URL: https://issues.apache.org/jira/browse/SPARK-4705 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Marcelo Vanzin yarn-cluster mode will retry to run the driver in certain failure modes. If even logging is enabled, this will most probably fail, because: {noformat} Exception in thread Driver java.io.IOException: Log directory hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003 already exists! at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) {noformat} The even log path should be more unique. Or perhaps retries of the same app should clean up the old logs first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4897) Python 3 support
[ https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307463#comment-14307463 ] Josh Rosen commented on SPARK-4897: --- Hi [~ianozsvald], Until now, the main motivation for Python 2.6 support was that it's the default system Python on a few Linux distributions. So far, I think the overhead of supporting 2.6 has been fairly minimal, mostly involving a handful of small changes such as not treating certain object as context managers (e.g. Zipfile objects). Let's try porting to 2.7 / 3.4 and then re-assess how hard Python 2.6 support will be. If it's really easy (a couple hours of work, max) then I don't see a reason to drop it, but if we have to go to increasingly convoluted lengths to keep it then it's probably not worth it if we're gaining 3.4 support in return. I think the main blocker to Python 3.4 support is the fact that nobody has really had time to work on it. I'd be happy to work with anyone who is interested in taking this on. Python 3 support Key: SPARK-4897 URL: https://issues.apache.org/jira/browse/SPARK-4897 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Josh Rosen Priority: Minor It would be nice to have Python 3 support in PySpark, provided that we can do it in a way that maintains backwards-compatibility with Python 2.6. I started looking into porting this; my WIP work can be found at https://github.com/JoshRosen/spark/compare/python3 I was able to use the [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] tool to handle the basic conversion of things like {{print}} statements, etc. and had to manually fix up a few imports for packages that moved / were renamed, but the major blocker that I hit was {{cloudpickle}}: {code} [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark Python 3.4.2 (default, Oct 19 2014, 17:52:17) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin Type help, copyright, credits or license for more information. Traceback (most recent call last): File /Users/joshrosen/Documents/Spark/python/pyspark/shell.py, line 28, in module import pyspark File /Users/joshrosen/Documents/spark/python/pyspark/__init__.py, line 41, in module from pyspark.context import SparkContext File /Users/joshrosen/Documents/spark/python/pyspark/context.py, line 26, in module from pyspark import accumulators File /Users/joshrosen/Documents/spark/python/pyspark/accumulators.py, line 97, in module from pyspark.cloudpickle import CloudPickler File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 120, in module class CloudPickler(pickle.Pickler): File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 122, in CloudPickler dispatch = pickle.Pickler.dispatch.copy() AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch' {code} This code looks like it will be hard difficult to port to Python 3, so this might be a good reason to switch to [Dill|https://github.com/uqfoundation/dill] for Python serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4897) Python 3 support
[ https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4897: -- Target Version/s: 1.4.0 (was: 1.3.0) Python 3 support Key: SPARK-4897 URL: https://issues.apache.org/jira/browse/SPARK-4897 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Josh Rosen Priority: Minor It would be nice to have Python 3 support in PySpark, provided that we can do it in a way that maintains backwards-compatibility with Python 2.6. I started looking into porting this; my WIP work can be found at https://github.com/JoshRosen/spark/compare/python3 I was able to use the [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] tool to handle the basic conversion of things like {{print}} statements, etc. and had to manually fix up a few imports for packages that moved / were renamed, but the major blocker that I hit was {{cloudpickle}}: {code} [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark Python 3.4.2 (default, Oct 19 2014, 17:52:17) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin Type help, copyright, credits or license for more information. Traceback (most recent call last): File /Users/joshrosen/Documents/Spark/python/pyspark/shell.py, line 28, in module import pyspark File /Users/joshrosen/Documents/spark/python/pyspark/__init__.py, line 41, in module from pyspark.context import SparkContext File /Users/joshrosen/Documents/spark/python/pyspark/context.py, line 26, in module from pyspark import accumulators File /Users/joshrosen/Documents/spark/python/pyspark/accumulators.py, line 97, in module from pyspark.cloudpickle import CloudPickler File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 120, in module class CloudPickler(pickle.Pickler): File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 122, in CloudPickler dispatch = pickle.Pickler.dispatch.copy() AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch' {code} This code looks like it will be hard difficult to port to Python 3, so this might be a good reason to switch to [Dill|https://github.com/uqfoundation/dill] for Python serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4897) Python 3 support
[ https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307474#comment-14307474 ] thom neale commented on SPARK-4897: --- I'm still very interested in helping with the 3.4 port, have only been prohibited by lack of free time. I'll ask if work will give me a half day to work on it. Python 3 support Key: SPARK-4897 URL: https://issues.apache.org/jira/browse/SPARK-4897 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Josh Rosen Priority: Minor It would be nice to have Python 3 support in PySpark, provided that we can do it in a way that maintains backwards-compatibility with Python 2.6. I started looking into porting this; my WIP work can be found at https://github.com/JoshRosen/spark/compare/python3 I was able to use the [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] tool to handle the basic conversion of things like {{print}} statements, etc. and had to manually fix up a few imports for packages that moved / were renamed, but the major blocker that I hit was {{cloudpickle}}: {code} [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark Python 3.4.2 (default, Oct 19 2014, 17:52:17) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin Type help, copyright, credits or license for more information. Traceback (most recent call last): File /Users/joshrosen/Documents/Spark/python/pyspark/shell.py, line 28, in module import pyspark File /Users/joshrosen/Documents/spark/python/pyspark/__init__.py, line 41, in module from pyspark.context import SparkContext File /Users/joshrosen/Documents/spark/python/pyspark/context.py, line 26, in module from pyspark import accumulators File /Users/joshrosen/Documents/spark/python/pyspark/accumulators.py, line 97, in module from pyspark.cloudpickle import CloudPickler File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 120, in module class CloudPickler(pickle.Pickler): File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 122, in CloudPickler dispatch = pickle.Pickler.dispatch.copy() AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch' {code} This code looks like it will be hard difficult to port to Python 3, so this might be a good reason to switch to [Dill|https://github.com/uqfoundation/dill] for Python serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5616) Add examples for PySpark API
dongxu created SPARK-5616: - Summary: Add examples for PySpark API Key: SPARK-5616 URL: https://issues.apache.org/jira/browse/SPARK-5616 Project: Spark Issue Type: Improvement Components: PySpark Reporter: dongxu Fix For: 1.3.0 PySpark API examples are less than Spark scala API. For example: 1.Boardcast: how to use boardcast operation APi 2.Module: how to import a other python file in zip file. Add more examples for freshman who wanna use PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5617) test failure of SQLQuerySuite
wangfei created SPARK-5617: -- Summary: test failure of SQLQuerySuite Key: SPARK-5617 URL: https://issues.apache.org/jira/browse/SPARK-5617 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: wangfei SQLQuerySuite test failure: [info] - simple select (22 milliseconds) [info] - sorting (722 milliseconds) [info] - external sorting (728 milliseconds) [info] - limit (95 milliseconds) [info] - date row *** FAILED *** (35 milliseconds) [info] Results do not match for query: [info] 'Limit 1 [info]'Project [CAST(2015-01-28, DateType) AS c0#3630] [info] 'UnresolvedRelation [testData], None [info] [info] == Analyzed Plan == [info] Limit 1 [info]Project [CAST(2015-01-28, DateType) AS c0#3630] [info] LogicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35 [info] [info] == Physical Plan == [info] Limit 1 [info]Project [16463 AS c0#3630] [info] PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35 [info] [info] == Results == [info] !== Correct Answer - 1 == == Spark Answer - 1 == [info] ![2015-01-28] [2015-01-27] (QueryTest.scala:77) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$class.fail(Assertions.scala:1328) [info] at org.scalatest.FunSuite.fail(FunSuite.scala:1555) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:77) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:95) [info] at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply$mcV$sp(SQLQuerySuite.scala:300) [info] at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300) [info] at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5618) Optimise utility code.
Makoto Fukuhara created SPARK-5618: -- Summary: Optimise utility code. Key: SPARK-5618 URL: https://issues.apache.org/jira/browse/SPARK-5618 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Makoto Fukuhara Priority: Minor I refactored the evaluation timing and unnecessary Regex API call. Because Regex API is heavy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5631) Server IPC version 7 cannot communicate with client version 4
DeepakVohra created SPARK-5631: -- Summary: Server IPC version 7 cannot communicate with client version 4 Key: SPARK-5631 URL: https://issues.apache.org/jira/browse/SPARK-5631 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: Scala 2.10.4 Spark 1.2 CDH4.2 Reporter: DeepakVohra A Spark application generates the error Server IPC version 7 cannot communicate with client version 4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5135) Add support for describe [extended] table to DDL in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5135. Resolution: Fixed Add support for describe [extended] table to DDL in SQLContext -- Key: SPARK-5135 URL: https://issues.apache.org/jira/browse/SPARK-5135 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.3.0 Reporter: shengli Priority: Minor Fix For: 1.3.0 Original Estimate: 72h Remaining Estimate: 72h Support Describe Table Command. describe [extended] tableName. This also support external datasource table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5615) Fix testPackage in StreamingContextSuite
Liang-Chi Hsieh created SPARK-5615: -- Summary: Fix testPackage in StreamingContextSuite Key: SPARK-5615 URL: https://issues.apache.org/jira/browse/SPARK-5615 Project: Spark Issue Type: Bug Reporter: Liang-Chi Hsieh Priority: Minor testPackage in StreamingContextSuite often throws SparkException because its ssc is not shut down gracefully. Not affect the unit test but I think we can make it graceful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5608) Improve SEO of Spark documentation site to let Google find latest docs
[ https://issues.apache.org/jira/browse/SPARK-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-5608. -- Resolution: Fixed Fix Version/s: 1.3.0 Improve SEO of Spark documentation site to let Google find latest docs -- Key: SPARK-5608 URL: https://issues.apache.org/jira/browse/SPARK-5608 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Matei Zaharia Assignee: Matei Zaharia Fix For: 1.3.0 Google currently has trouble finding spark.apache.org/docs/latest, so a lot of the results returned for various queries are from random previous versions of Spark where someone created a link. I'd like to do the following: - Add a sitemap.xml to spark.apache.org that lists all the docs/latest pages (already done) - Add meta description tags on some of the most important doc pages - Shorten the titles of some pages to have more relevant keywords; for example there's no reason to have Spark SQL Programming Guide - Spark 1.2.0 documentation, we can just say Spark SQL - Spark 1.2.0 documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2808) update kafka to version 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307254#comment-14307254 ] koert kuipers commented on SPARK-2808: -- what is the motivation for this upgrade? update kafka to version 0.8.2 - Key: SPARK-2808 URL: https://issues.apache.org/jira/browse/SPARK-2808 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati First kafka_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2808) update kafka to version 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307254#comment-14307254 ] koert kuipers edited comment on SPARK-2808 at 2/5/15 2:28 PM: -- what is the motivation for this upgrade? the offset storage in kafka? was (Author: koert): what is the motivation for this upgrade? update kafka to version 0.8.2 - Key: SPARK-2808 URL: https://issues.apache.org/jira/browse/SPARK-2808 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati First kafka_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:
[ https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307289#comment-14307289 ] Takeshi Yamamuro commented on SPARK-5480: - These codes didn't throw such exceptions in my environments. What's the predicate in subgraph() and the input graph? GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: --- Key: SPARK-5480 URL: https://issues.apache.org/jira/browse/SPARK-5480 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.0 Environment: Yarn client Reporter: Stephane Maarek Running the following code: val subgraph = graph.subgraph ( vpred = (id,article) = //working predicate) ).cache() println( sSubgraph contains ${subgraph.vertices.count} nodes and ${subgraph.edges.count} edges) val prGraph = subgraph.staticPageRank(5).cache val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) { (v, title, rank) = (rank.getOrElse(0.0), title) } titleAndPrGraph.vertices.top(13) { Ordering.by((entry: (VertexId, (Double, _))) = entry._2._1) }.foreach(t = println(t._2._2._1 + : + t._2._1 + , id: + t._1)) Returns a graph with 5000 nodes and 4000 edges. Then it crashes during the PageRank with the following: 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes) 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64) at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110) at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at
[jira] [Created] (SPARK-5632) not able to resolve dot('.') in field name
Lishu Liu created SPARK-5632: Summary: not able to resolve dot('.') in field name Key: SPARK-5632 URL: https://issues.apache.org/jira/browse/SPARK-5632 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: Spark cluster: EC2 m1.small + Spark 1.2.0 Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2 Reporter: Lishu Liu My cassandra table task_trace has a field sm.result which contains dot in the name. So SQL tried to look up sm instead of full name 'sm.result'. Here is my code: scala import org.apache.spark.sql.cassandra.CassandraSQLContext scala val cc = new CassandraSQLContext(sc) scala val task_trace = cc.jsonFile(/task_trace.json) scala task_trace.registerTempTable(task_trace) scala cc.setKeyspace(cerberus_data_v4) scala val res = cc.sql(SELECT received_datetime, task_body.cerberus_id, task_body.sm.result FROM task_trace WHERE task_id = 'fff7304e-9984-4b45-b10c-0423a96745ce') res: org.apache.spark.sql.SchemaRDD = SchemaRDD[57] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, cerberus_id, couponId, coupon_code, created, description, domain, expires, message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, validity The full schema look like this: scala task_trace.printSchema() root |-- received_datetime: long (nullable = true) |-- task_body: struct (nullable = true) ||-- cerberus_batch_id: string (nullable = true) ||-- cerberus_id: string (nullable = true) ||-- couponId: integer (nullable = true) ||-- coupon_code: string (nullable = true) ||-- created: string (nullable = true) ||-- description: string (nullable = true) ||-- domain: string (nullable = true) ||-- expires: string (nullable = true) ||-- message_id: string (nullable = true) ||-- neverShowAfter: string (nullable = true) ||-- neverShowBefore: string (nullable = true) ||-- offerTitle: string (nullable = true) ||-- screenshots: array (nullable = true) |||-- element: string (containsNull = false) ||-- sm.result: struct (nullable = true) |||-- cerberus_batch_id: string (nullable = true) |||-- cerberus_id: string (nullable = true) |||-- code: string (nullable = true) |||-- couponId: integer (nullable = true) |||-- created: string (nullable = true) |||-- description: string (nullable = true) |||-- domain: string (nullable = true) |||-- expires: string (nullable = true) |||-- message_id: string (nullable = true) |||-- neverShowAfter: string (nullable = true) |||-- neverShowBefore: string (nullable = true) |||-- offerTitle: string (nullable = true) |||-- result: struct (nullable = true) ||||-- post: struct (nullable = true) |||||-- alchemy_out_of_stock: struct (nullable = true) ||||||-- ci: double (nullable = true) ||||||-- value: boolean (nullable = true) |||||-- meta: struct (nullable = true) ||||||-- None_tx_value: array (nullable = true) |||||||-- element: string (containsNull = false) ||||||-- exceptions: array (nullable = true) |||||||-- element: string (containsNull = false) ||||||-- no_input_value: array (nullable = true) |||||||-- element: string (containsNull = false) ||||||-- not_mapped: array (nullable = true) |||||||-- element: string (containsNull = false) ||||||-- not_transformed: array (nullable = true) |||||||-- element: array (containsNull = false) ||||||||-- element: string (containsNull = false) |||||-- now_price_checkout: struct (nullable = true) ||||||-- ci: double (nullable = true) ||||||-- value: double (nullable = true) |||||-- shipping_price: struct (nullable = true) ||||||-- ci: double (nullable = true) ||||||-- value: double (nullable = true) |||||-- tax: struct (nullable = true) ||||||-- ci: double (nullable = true) ||||||-- value: double (nullable = true) |||||-- total: struct (nullable = true) ||||||-- ci: double (nullable = true) ||||||-- value: double (nullable = true) ||||-- pre: struct (nullable = true) |||||-- alchemy_out_of_stock: struct (nullable = true) |||
[jira] [Created] (SPARK-5633) pyspark saveAsTextFile support for compression codec
Vladimir Vladimirov created SPARK-5633: -- Summary: pyspark saveAsTextFile support for compression codec Key: SPARK-5633 URL: https://issues.apache.org/jira/browse/SPARK-5633 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.2.0 Reporter: Vladimir Vladimirov Priority: Minor Scala and Java API allows to provide compression codec with saveAsTextFile(path, codec) PySpark saveAsTextFile API does not support passing codec class. This story is about adding saveAsTextFile(path, codec) support into pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5634) History server shows misleading message when there are no incomplete apps
Marcelo Vanzin created SPARK-5634: - Summary: History server shows misleading message when there are no incomplete apps Key: SPARK-5634 URL: https://issues.apache.org/jira/browse/SPARK-5634 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Marcelo Vanzin Priority: Minor If you go to the history server, and click on Show incomplete applications, but there are no incomplete applications, you get a misleading message: {noformat} No completed applications found! Did you specify the correct logging directory? (etc etc) {noformat} That's the same message used when no complete applications are found; it should probably be tweaked for the incomplete apps case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5633) pyspark saveAsTextFile support for compression codec
[ https://issues.apache.org/jira/browse/SPARK-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308128#comment-14308128 ] Vladimir Vladimirov commented on SPARK-5633: Here is a workaround before proposed functionality will be accepted: {code} def saveAsTextFileCompressed(t, path, codec=org.apache.hadoop.io.compress.GzipCodec): def func(split, iterator): for x in iterator: if not isinstance(x, basestring): x = unicode(x) if isinstance(x, unicode): x = x.encode(utf-8) yield x keyed = t.mapPartitionsWithIndex(func) keyed._bypass_serializer = True codecClass = SparkContext._jvm.java.lang.Class.forName(codec) keyed._jrdd.map(t.ctx._jvm.BytesToString()).saveAsTextFile(path, codecClass) {code} pyspark saveAsTextFile support for compression codec Key: SPARK-5633 URL: https://issues.apache.org/jira/browse/SPARK-5633 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.2.0 Reporter: Vladimir Vladimirov Priority: Minor Scala and Java API allows to provide compression codec with saveAsTextFile(path, codec) PySpark saveAsTextFile API does not support passing codec class. This story is about adding saveAsTextFile(path, codec) support into pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5622) Add connector/handler hive configuration settings to hive-thrift-server
[ https://issues.apache.org/jira/browse/SPARK-5622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308141#comment-14308141 ] Apache Spark commented on SPARK-5622: - User 'alexliu68' has created a pull request for this issue: https://github.com/apache/spark/pull/4406 Add connector/handler hive configuration settings to hive-thrift-server --- Key: SPARK-5622 URL: https://issues.apache.org/jira/browse/SPARK-5622 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0, 1.1.1 Reporter: Alex Liu When integrate Cassandra Storage handler to Spark SQL, we need pass some configuration settings to Hive-thrift-server hiveConf during server starting process. e.g. {code} ./sbin/start-thriftserver.sh --hiveconf cassandra.username=cassandra --hiveconf cassandra.password=cassandra {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5493) Support proxy users under kerberos
[ https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308131#comment-14308131 ] Apache Spark commented on SPARK-5493: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/4405 Support proxy users under kerberos -- Key: SPARK-5493 URL: https://issues.apache.org/jira/browse/SPARK-5493 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Brock Noland When using kerberos, services may want to use spark-submit to submit jobs as a separate user. For example a service like oozie might want to submit jobs as a client user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5335) Destroying cluster in VPC with --delete-groups fails to remove security groups
[ https://issues.apache.org/jira/browse/SPARK-5335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308133#comment-14308133 ] Nicholas Chammas commented on SPARK-5335: - For the record: [AWS says|https://forums.aws.amazon.com/thread.jspa?messageID=572559] you must use the group ID (as opposed to the name) when deleting groups within a VPC. Destroying cluster in VPC with --delete-groups fails to remove security groups Key: SPARK-5335 URL: https://issues.apache.org/jira/browse/SPARK-5335 Project: Spark Issue Type: Bug Components: EC2 Reporter: Vladimir Grigor When I try to remove security groups using option of the script, it fails because in VPC one should remove security groups by id, not name as it is now. {code} $ ./spark-ec2 -k key20141114 -i ~/key.pem --region=eu-west-1 --delete-groups destroy SparkByScript Are you sure you want to destroy the cluster SparkByScript? The following instances will be terminated: Searching for existing cluster SparkByScript... ALL DATA ON ALL NODES WILL BE LOST!! Destroy cluster SparkByScript (y/N): y Terminating master... Terminating slaves... Deleting security groups (this will take some time)... Waiting for cluster to enter 'terminated' state. Cluster is now in 'terminated' state. Waited 0 seconds. Attempt 1 Deleting rules in security group SparkByScript-slaves Deleting rules in security group SparkByScript-master ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'SparkByScript-slaves' for groupName. You may not reference Amazon VPC security groups by name. Please use the corresponding id for this operation./Message/Error/ErrorsRequestID60313fac-5d47-48dd-a8bf-e9832948c0a6/RequestID/Response Failed to delete security group SparkByScript-slaves ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'SparkByScript-master' for groupName. You may not reference Amazon VPC security groups by name. Please use the corresponding id for this operation./Message/Error/ErrorsRequestID74ff8431-c0c1-4052-9ecb-c0adfa7eeeac/RequestID/Response Failed to delete security group SparkByScript-master Attempt 2 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5604) Remove setCheckpointDir from LDA and tree Strategy
[ https://issues.apache.org/jira/browse/SPARK-5604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308166#comment-14308166 ] Apache Spark commented on SPARK-5604: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4407 Remove setCheckpointDir from LDA and tree Strategy -- Key: SPARK-5604 URL: https://issues.apache.org/jira/browse/SPARK-5604 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Continue the discussion from the LDA PR. CheckpoingDir is a global Spark configuration, which should not be altered by an ML algorithm. We could check whether checkpointDir is set if checkpointInterval is positive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5633) pyspark saveAsTextFile support for compression codec
[ https://issues.apache.org/jira/browse/SPARK-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308077#comment-14308077 ] Vladimir Vladimirov commented on SPARK-5633: Here is pull request that adds mentioned functionality https://github.com/apache/spark/pull/4403 pyspark saveAsTextFile support for compression codec Key: SPARK-5633 URL: https://issues.apache.org/jira/browse/SPARK-5633 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.2.0 Reporter: Vladimir Vladimirov Priority: Minor Scala and Java API allows to provide compression codec with saveAsTextFile(path, codec) PySpark saveAsTextFile API does not support passing codec class. This story is about adding saveAsTextFile(path, codec) support into pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5620) Group methods in generated unidoc
[ https://issues.apache.org/jira/browse/SPARK-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308084#comment-14308084 ] Apache Spark commented on SPARK-5620: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4404 Group methods in generated unidoc - Key: SPARK-5620 URL: https://issues.apache.org/jira/browse/SPARK-5620 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng Having methods show up in groups makes the doc more readable. For ML, we have many parameters and their setters/getters, it is necessary to group them. Same applies to the new DataFrame API. The grouping disappeared in recent versions of sbt-unidoc. We may miss some compiler options. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5632) not able to resolve dot('.') in field name
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lishu Liu updated SPARK-5632: - Description: My cassandra table task_trace has a field sm.result which contains dot in the name. So SQL tried to look up sm instead of full name 'sm.result'. Here is my code: scala import org.apache.spark.sql.cassandra.CassandraSQLContext scala val cc = new CassandraSQLContext(sc) scala val task_trace = cc.jsonFile(/task_trace.json) scala task_trace.registerTempTable(task_trace) scala cc.setKeyspace(cerberus_data_v4) scala val res = cc.sql(SELECT received_datetime, task_body.cerberus_id, task_body.sm.result FROM task_trace WHERE task_id = 'fff7304e-9984-4b45-b10c-0423a96745ce') res: org.apache.spark.sql.SchemaRDD = SchemaRDD[57] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, cerberus_id, couponId, coupon_code, created, description, domain, expires, message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, validity The full schema look like this: scala task_trace.printSchema() root \|-- received_datetime: long (nullable = true) \|-- task_body: struct (nullable = true) \|\|-- cerberus_batch_id: string (nullable = true) \|\|-- cerberus_id: string (nullable = true) \|\|-- couponId: integer (nullable = true) \|\|-- coupon_code: string (nullable = true) \|\|-- created: string (nullable = true) \|\|-- description: string (nullable = true) \|\|-- domain: string (nullable = true) \|\|-- expires: string (nullable = true) \|\|-- message_id: string (nullable = true) \|\|-- neverShowAfter: string (nullable = true) \|\|-- neverShowBefore: string (nullable = true) \|\|-- offerTitle: string (nullable = true) \|\|-- screenshots: array (nullable = true) \|\|\|-- element: string (containsNull = false) \|\|-- sm.result: struct (nullable = true) \|\|\|-- cerberus_batch_id: string (nullable = true) \|\|\|-- cerberus_id: string (nullable = true) \|\|\|-- code: string (nullable = true) \|\|\|-- couponId: integer (nullable = true) \|\|\|-- created: string (nullable = true) \|\|\|-- description: string (nullable = true) \|\|\|-- domain: string (nullable = true) \|\|\|-- expires: string (nullable = true) \|\|\|-- message_id: string (nullable = true) \|\|\|-- neverShowAfter: string (nullable = true) \|\|\|-- neverShowBefore: string (nullable = true) \|\|\|-- offerTitle: string (nullable = true) \|\|\|-- result: struct (nullable = true) \|\|\|\|-- post: struct (nullable = true) \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: boolean (nullable = true) \|\|\|\|\|-- meta: struct (nullable = true) \|\|\|\|\|\|-- None_tx_value: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- exceptions: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- no_input_value: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- not_mapped: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- not_transformed: array (nullable = true) \|\|\|\|\|\|\|-- element: array (containsNull = false) \|\|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|-- now_price_checkout: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: double (nullable = true) \|\|\|\|\|-- shipping_price: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: double (nullable = true) \|\|\|\|\|-- tax: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: double (nullable = true) \|\|\|\|\|-- total: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: double (nullable = true) \|\|\|\|-- pre: struct (nullable = true) \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true) \|\|\|\|\|\|-- ci: double
[jira] [Resolved] (SPARK-5528) Support schema merging while reading Parquet files
[ https://issues.apache.org/jira/browse/SPARK-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5528. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4308 [https://github.com/apache/spark/pull/4308] Support schema merging while reading Parquet files -- Key: SPARK-5528 URL: https://issues.apache.org/jira/browse/SPARK-5528 Project: Spark Issue Type: Improvement Reporter: Cheng Lian Fix For: 1.3.0 Spark 1.2.0 and prior versions only reads Parquet schema from {{_metadata}} or a random Parquet part-file, and assumes all part-files share exactly the same schema. In practice, it's common that users append new columns to existing Parquet schema. Parquet has native schema merging support for such scenarios. Spark SQL should also support this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308263#comment-14308263 ] Marcelo Vanzin commented on SPARK-5388: --- Also, a fun fact about the Jersey dependency. Here's an excerpt of the output of mvn dependency:tree for the yarn module: {noformat} [INFO] +- org.apache.hadoop:hadoop-yarn-common:jar:2.4.0:compile [INFO] | +- javax.xml.bind:jaxb-api:jar:2.2.2:compile [INFO] | | +- javax.xml.stream:stax-api:jar:1.0-2:compile [INFO] | | \- javax.activation:activation:jar:1.1:compile [INFO] | +- org.apache.commons:commons-compress:jar:1.4.1:compile [INFO] | | \- org.tukaani:xz:jar:1.0:compile [INFO] | +- commons-codec:commons-codec:jar:1.5:compile [INFO] | +- com.sun.jersey:jersey-core:jar:1.9:compile {noformat} Provide a stable application submission gateway in standalone cluster mode -- Key: SPARK-5388 URL: https://issues.apache.org/jira/browse/SPARK-5388 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Attachments: stable-spark-submit-in-standalone-mode-2-4-15.pdf The existing submission gateway in standalone mode is not compatible across Spark versions. If you have a newer version of Spark submitting to an older version of the standalone Master, it is currently not guaranteed to work. The goal is to provide a stable REST interface to replace this channel. For more detail, please see the most recent design doc attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5509) EqualTo operator doesn't handle binary type properly
[ https://issues.apache.org/jira/browse/SPARK-5509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5509. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4308 [https://github.com/apache/spark/pull/4308] EqualTo operator doesn't handle binary type properly Key: SPARK-5509 URL: https://issues.apache.org/jira/browse/SPARK-5509 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.3.0, 1.2.1 Reporter: Cheng Lian Fix For: 1.3.0 Binary type is mapped to {{Array\[Byte\]}}, which can't be compared with {{==}} directly. However, {{EqualTo.eval()}} uses plain {{==}} to compare values. Run the following {{spark-shell}} snippet with Spark 1.2.0 to reproduce this issue: {code} import org.apache.spark.sql.SQLContext import sc._ val sqlContext = new SQLContext(sc) import sqlContext._ case class KV(key: Int, value: Array[Byte]) def toBinary(s: String): Array[Byte] = s.toString.getBytes(UTF-8) registerFunction(toBinary, toBinary _) parallelize(1 to 1024).map(i = KV(i, toBinary(i.toString))).registerTempTable(bin) // OK sql(select * from bin where value toBinary('100')).collect() // Oops, returns nothing sql(select * from bin where value = toBinary('100')).collect() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5635) Allow users to run .scala files directly from spark-submit
Grant Henke created SPARK-5635: -- Summary: Allow users to run .scala files directly from spark-submit Key: SPARK-5635 URL: https://issues.apache.org/jira/browse/SPARK-5635 Project: Spark Issue Type: New Feature Components: Spark Core, Spark Shell Reporter: Grant Henke Priority: Minor Similar to the python functionality allow users to submit .scala files. Currently the way I simulate this is to use spark-shell and run: `spark-shell -i myscript.scala` Note: user needs to add exit to the bottom of the script. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5637) Expose spark_ec2 as as StarCluster Plugin
Alex Rothberg created SPARK-5637: Summary: Expose spark_ec2 as as StarCluster Plugin Key: SPARK-5637 URL: https://issues.apache.org/jira/browse/SPARK-5637 Project: Spark Issue Type: Improvement Reporter: Alex Rothberg Priority: Minor Starcluster has a lot features in place for stating EC2 instances and it would be great to have an option to leverage that as a plugin. See: http://star.mit.edu/cluster/docs/latest/index.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5638) Add a config flag to disable eager analysis of DataFrames
Reynold Xin created SPARK-5638: -- Summary: Add a config flag to disable eager analysis of DataFrames Key: SPARK-5638 URL: https://issues.apache.org/jira/browse/SPARK-5638 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin SInce DataFrames are eagerly analyzed, there is no way to construct a DataFrame that is invalid anymore (which can be very useful for debugging invalid queries). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5638) Add a config flag to disable eager analysis of DataFrames
[ https://issues.apache.org/jira/browse/SPARK-5638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308286#comment-14308286 ] Apache Spark commented on SPARK-5638: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4408 Add a config flag to disable eager analysis of DataFrames - Key: SPARK-5638 URL: https://issues.apache.org/jira/browse/SPARK-5638 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin SInce DataFrames are eagerly analyzed, there is no way to construct a DataFrame that is invalid anymore (which can be very useful for debugging invalid queries). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5639) Support DataFrame.renameColumn
Reynold Xin created SPARK-5639: -- Summary: Support DataFrame.renameColumn Key: SPARK-5639 URL: https://issues.apache.org/jira/browse/SPARK-5639 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin It is incredibly hard to rename a column using the existing DSL. Let's provide that out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3454) Expose JSON representation of data shown in WebUI
[ https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308324#comment-14308324 ] Marcelo Vanzin commented on SPARK-3454: --- Hi [~imranr], There are two ways I can see to solve the routing problem: The first is the one you mention. I like it because, as you say, it keeps the API consistent across different UIs. You always look at /json, not some subpath that depends on which daemon you're looking at. The second is to remove the notion of an application list from this spec. That means the /json tree would be mounted under the application's path, not at the root of the web server. The downside is that when you add an API to list applications to the master / history server, things will look weird (you have /json/v1/applications and /{applicationId}/json/v1 instead of a single tree). Clients would have to adapt depending on whether they're talking to an app directly, or to the master / history server. So yeah, I like your suggestion better. Expose JSON representation of data shown in WebUI - Key: SPARK-3454 URL: https://issues.apache.org/jira/browse/SPARK-3454 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.1.0 Reporter: Kousuke Saruta Attachments: sparkmonitoringjsondesign.pdf If WebUI support to JSON format extracting, it's helpful for user who want to analyse stage / task / executor information. Fortunately, WebUI has renderJson method so we can implement the method in each subclass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5639) Support DataFrame.renameColumn
[ https://issues.apache.org/jira/browse/SPARK-5639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308328#comment-14308328 ] Apache Spark commented on SPARK-5639: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4410 Support DataFrame.renameColumn -- Key: SPARK-5639 URL: https://issues.apache.org/jira/browse/SPARK-5639 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin It is incredibly hard to rename a column using the existing DSL. Let's provide that out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5620) Group methods in generated unidoc
[ https://issues.apache.org/jira/browse/SPARK-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5620. Resolution: Fixed Fix Version/s: 1.3.0 Group methods in generated unidoc - Key: SPARK-5620 URL: https://issues.apache.org/jira/browse/SPARK-5620 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.3.0 Having methods show up in groups makes the doc more readable. For ML, we have many parameters and their setters/getters, it is necessary to group them. Same applies to the new DataFrame API. The grouping disappeared in recent versions of sbt-unidoc. We may miss some compiler options. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5604) Remove setCheckpointDir from LDA and tree Strategy
[ https://issues.apache.org/jira/browse/SPARK-5604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5604. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4390 [https://github.com/apache/spark/pull/4390] Remove setCheckpointDir from LDA and tree Strategy -- Key: SPARK-5604 URL: https://issues.apache.org/jira/browse/SPARK-5604 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.3.0 Continue the discussion from the LDA PR. CheckpoingDir is a global Spark configuration, which should not be altered by an ML algorithm. We could check whether checkpointDir is set if checkpointInterval is positive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5182) Partitioning support for tables created by the data source API
[ https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5182. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4308 [https://github.com/apache/spark/pull/4308] Partitioning support for tables created by the data source API -- Key: SPARK-5182 URL: https://issues.apache.org/jira/browse/SPARK-5182 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet
[ https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3575. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4308 [https://github.com/apache/spark/pull/4308] Hive Schema is ignored when using convertMetastoreParquet - Key: SPARK-3575 URL: https://issues.apache.org/jira/browse/SPARK-3575 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Fix For: 1.3.0 This can cause problems when for example one of the columns is defined as TINYINT. A class cast exception will be thrown since the parquet table scan produces INTs while the rest of the execution is expecting bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5624) Can't find new column
[ https://issues.apache.org/jira/browse/SPARK-5624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308261#comment-14308261 ] Alex Liu commented on SPARK-5624: - Test it on the latest master branch it doesn't have this issue. Can't find new column -- Key: SPARK-5624 URL: https://issues.apache.org/jira/browse/SPARK-5624 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.1 Reporter: Alex Liu Priority: Minor The following test fails {code} 0: jdbc:hive2://localhost:1 DROP TABLE IF EXISTS alter_test_table; +-+ | Result | +-+ +-+ No rows selected (0.175 seconds) 0: jdbc:hive2://localhost:1 DROP TABLE IF EXISTS alter_test_table_ctas; +-+ | Result | +-+ +-+ No rows selected (0.155 seconds) 0: jdbc:hive2://localhost:1 DROP TABLE IF EXISTS alter_test_table_renamed; +-+ | Result | +-+ +-+ No rows selected (0.162 seconds) 0: jdbc:hive2://localhost:1 CREATE TABLE alter_test_table (foo INT, bar STRING) COMMENT 'table to test DDL ops' PARTITIONED BY (ds STRING) STORED AS TEXTFILE; +-+ | result | +-+ +-+ No rows selected (0.247 seconds) 0: jdbc:hive2://localhost:1 LOAD DATA LOCAL INPATH '/Users/alex/project/automaton/resources/tests/data/files/kv1.txt' OVERWRITE INTO TABLE alter_test_table PARTITION (ds='2008-08-08'); +-+ | result | +-+ +-+ No rows selected (0.367 seconds) 0: jdbc:hive2://localhost:1 CREATE TABLE alter_test_table_ctas as SELECT * FROM alter_test_table; +--+--+-+ | foo | bar | ds | +--+--+-+ +--+--+-+ No rows selected (0.641 seconds) 0: jdbc:hive2://localhost:1 ALTER TABLE alter_test_table ADD COLUMNS (new_col1 INT); +-+ | result | +-+ +-+ No rows selected (0.226 seconds) 0: jdbc:hive2://localhost:1 INSERT OVERWRITE TABLE alter_test_table PARTITION (ds='2008-08-15') SELECT foo, bar, 3 FROM alter_test_table_ctas WHERE ds='2008-08-08'; +--+--+--+ | foo | bar | c_2 | +--+--+--+ +--+--+--+ No rows selected (0.522 seconds) 0: jdbc:hive2://localhost:1 select * from alter_test_table ; Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 35.0 failed 4 times, most recent failure: Lost task 0.3 in stage 35.0 (TID 66, 127.0.0.1): java.lang.RuntimeException: cannot find field new_col1 from [0:foo, 1:bar] org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:367) org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldRef(LazySimpleStructObjectInspector.java:168) org.apache.spark.sql.hive.HadoopTableReader$$anonfun$9.apply(TableReader.scala:275) org.apache.spark.sql.hive.HadoopTableReader$$anonfun$9.apply(TableReader.scala:275) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.AbstractTraversable.map(Traversable.scala:105) org.apache.spark.sql.hive.HadoopTableReader$.fillObject(TableReader.scala:275) org.apache.spark.sql.hive.HadoopTableReader$$anonfun$3$$anonfun$apply$1.apply(TableReader.scala:193) org.apache.spark.sql.hive.HadoopTableReader$$anonfun$3$$anonfun$apply$1.apply(TableReader.scala:187) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308260#comment-14308260 ] Marcelo Vanzin commented on SPARK-5388: --- Hi [~andrewor14], Thanks for updating the spec! This one looks much, much better. I think most of my concerns have been addressed. Adherence to RESTfulness is not super important since this is an internal API, although I really would suggest picking a better name for the Scala package (e.g. org.apache.spark.deploy.proto or something, instead of rest). A few questions: - is the action field required? Since you have different URIs handling different messages, it seems redundant now. And responses having an action is kinda weird. - what is the protocolVersion field in ErrorResponse? From the spec, it sounds like the maximum protocol version supported by the server. If that's the case, can the property be renamed to maxProtocolVersion? - the message definitions use strings for all data, is that intentional? It would feel more natural to have proper types, e.g.: jars : [ one.jar, two.jar ], driverCores: 8, superviseDriver: false. - The spec says the server should report unknown fields back to the client. There's nothing in the response type that supports that; is the server expected to embed that information in the message field? Feels like it would be better to have an explicit field for that. - Is the kill endpoint protected in any way? Right now it seems like anyone can post to that and kill a driver, if they know (or guess) the submission ID. If there's no special protection, I'd say in the spec that the submission ID should be, at least, cryptographically secure. At that point, as long as the server has SSL enabled, it should be hard enough to kill a random driver. Provide a stable application submission gateway in standalone cluster mode -- Key: SPARK-5388 URL: https://issues.apache.org/jira/browse/SPARK-5388 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Attachments: stable-spark-submit-in-standalone-mode-2-4-15.pdf The existing submission gateway in standalone mode is not compatible across Spark versions. If you have a newer version of Spark submitting to an older version of the standalone Master, it is currently not guaranteed to work. The goal is to provide a stable REST interface to replace this channel. For more detail, please see the most recent design doc attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5636) Lower dynamic allocation add interval
Andrew Or created SPARK-5636: Summary: Lower dynamic allocation add interval Key: SPARK-5636 URL: https://issues.apache.org/jira/browse/SPARK-5636 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or The current default of 1 min is a little long especially since a recent patch causes the number of executors to start at 0 by default. We should ramp up much more quickly in the beginning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5557) spark-shell failed to start
[ https://issues.apache.org/jira/browse/SPARK-5557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308346#comment-14308346 ] Patrick Wendell commented on SPARK-5557: I can send a fix for this shortly. It also works fine if you build with Hadoop 2 support. spark-shell failed to start --- Key: SPARK-5557 URL: https://issues.apache.org/jira/browse/SPARK-5557 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Guoqiang Li Priority: Blocker the log: {noformat} 5/02/03 19:06:39 INFO spark.HttpServer: Starting HTTP Server Exception in thread main java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse at org.apache.spark.HttpServer.org$apache$spark$HttpServer$$doStart(HttpServer.scala:75) at org.apache.spark.HttpServer$$anonfun$1.apply(HttpServer.scala:62) at org.apache.spark.HttpServer$$anonfun$1.apply(HttpServer.scala:62) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1774) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1765) at org.apache.spark.HttpServer.start(HttpServer.scala:62) at org.apache.spark.repl.SparkIMain.init(SparkIMain.scala:130) at org.apache.spark.repl.SparkILoop$SparkILoopInterpreter.init(SparkILoop.scala:185) at org.apache.spark.repl.SparkILoop.createInterpreter(SparkILoop.scala:214) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:946) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:942) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:942) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:942) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1039) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:403) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:77) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: javax.servlet.http.HttpServletResponse at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 25 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5621) Cannot fetch dependencies for mllib
Luca Venturini created SPARK-5621: - Summary: Cannot fetch dependencies for mllib Key: SPARK-5621 URL: https://issues.apache.org/jira/browse/SPARK-5621 Project: Spark Issue Type: Bug Reporter: Luca Venturini The mllib docs say to include com.github.fommil.netlib:all:1.1.2, but I cannot fetch any jar for this dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307580#comment-14307580 ] Philippe Girolami commented on SPARK-1867: -- Has anyone figured this out ? I'm seeing this happen when running spark-shell off the master branch (at cd5da42), using the same example as [~ansonism]. Works fine in 1.2.0, downloaded from the website. {code} val source = sc.textFile(/tmp/testfile.txt) source.saveAsTextFile(/tmp/test_spark_output) {code} I built master using {code} mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -Pbigtop-dist -DskipTests clean package install {code} on MacOS using Sun Java 7 {quote} java version 1.7.0_60 Java(TM) SE Runtime Environment (build 1.7.0_60-b19) Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode) {quote} Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59] My code snippet: val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(appName) .setSparkHome(sparkHome) .setJars(SparkContext.jarOfClass(this.getClass)) println(count = + new SparkContext(conf).textFile(someHdfsPath).count()) My SBT dependencies: // relevant org.apache.spark % spark-core_2.10 % 0.9.1, org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0, // standard, probably unrelated com.github.seratch %% awscala
[jira] [Commented] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307584#comment-14307584 ] Kostas Sakellis commented on SPARK-5081: Can you add a sample of the code too? Shuffle write increases --- Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. spark 1.1 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |9|saveAsTextFile| |1169.4KB| | |12|combineByKey| |1265.4KB|1275.0KB| |6|sortByKey| |1276.5KB| | |8|mapPartitions| |91.0MB|1383.1KB| |4|apply| |89.4MB| | |5|sortBy|155.6MB| |98.1MB| |3|sortBy|155.6MB| | | |1|collect| |2.1MB| | |2|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | spark 1.2 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |12|saveAsTextFile| |1170.2KB| | |11|combineByKey| |1264.5KB|1275.0KB| |8|sortByKey| |1273.6KB| | |7|mapPartitions| |134.5MB|1383.1KB| |5|zipWithIndex| |132.5MB| | |4|sortBy|155.6MB| |146.9MB| |3|sortBy|155.6MB| | | |2|collect| |2.0MB| | |1|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5622) Add connector/handler hive configuration settings to hive-thrift-server
Alex Liu created SPARK-5622: --- Summary: Add connector/handler hive configuration settings to hive-thrift-server Key: SPARK-5622 URL: https://issues.apache.org/jira/browse/SPARK-5622 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1, 1.1.0 Reporter: Alex Liu When integrate Cassandra Storage handler to Spark SQL, we need pass some configuration settings to Hive-thrift-server hiveConf during server starting process. e.g. {code} ./sbin/start-thriftserver.sh --hiveconf cassandra.username=cassandra --hiveconf cassandra.password=cassandra {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5623) Replace an obsolete mapReduceTriplets with a new aggregateMessages in GraphSuite
Takeshi Yamamuro created SPARK-5623: --- Summary: Replace an obsolete mapReduceTriplets with a new aggregateMessages in GraphSuite Key: SPARK-5623 URL: https://issues.apache.org/jira/browse/SPARK-5623 Project: Spark Issue Type: Test Components: GraphX Reporter: Takeshi Yamamuro -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5013) User guide for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307605#comment-14307605 ] Apache Spark commented on SPARK-5013: - User 'tgaloppo' has created a pull request for this issue: https://github.com/apache/spark/pull/4401 User guide for Gaussian Mixture Model - Key: SPARK-5013 URL: https://issues.apache.org/jira/browse/SPARK-5013 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Travis Galoppo Add GMM user guide with code examples in Scala/Java/Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:
[ https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307628#comment-14307628 ] Stephane Maarek commented on SPARK-5480: It happened once after one of my server failed, but the graph vertices and edges count did work. Doesn't happen systematically... having issues reproducing it val subgraph = graph.subgraph ( vpred = (id,article) = article._1.toLowerCase.contains(stringToSearchFor) || article._3.exists(keyword = keyword.contains(stringToSearchFor)) || (article._2 match { case None = false case Some(articleAbstract) = articleAbstract.toLowerCase.contains(stringToSearchFor) }) ).cache() GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: --- Key: SPARK-5480 URL: https://issues.apache.org/jira/browse/SPARK-5480 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.0 Environment: Yarn client Reporter: Stephane Maarek Running the following code: val subgraph = graph.subgraph ( vpred = (id,article) = //working predicate) ).cache() println( sSubgraph contains ${subgraph.vertices.count} nodes and ${subgraph.edges.count} edges) val prGraph = subgraph.staticPageRank(5).cache val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) { (v, title, rank) = (rank.getOrElse(0.0), title) } titleAndPrGraph.vertices.top(13) { Ordering.by((entry: (VertexId, (Double, _))) = entry._2._1) }.foreach(t = println(t._2._2._1 + : + t._2._1 + , id: + t._1)) Returns a graph with 5000 nodes and 4000 edges. Then it crashes during the PageRank with the following: 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes) 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64) at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110) at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at
[jira] [Resolved] (SPARK-5621) Cannot fetch dependencies for mllib
[ https://issues.apache.org/jira/browse/SPARK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5621. -- Resolution: Not a Problem It certainly exists : http://search.maven.org/#artifactdetails%7Ccom.github.fommil.netlib%7Call%7C1.1.2%7Cpom The docs actually suggest using the {{netlib-lgpl}} profile, and if you have a look at both of these you'll see that it's a pom-only artifact, so you need {{typepom/type}}. Cannot fetch dependencies for mllib --- Key: SPARK-5621 URL: https://issues.apache.org/jira/browse/SPARK-5621 Project: Spark Issue Type: Bug Reporter: Luca Venturini The mllib docs say to include com.github.fommil.netlib:all:1.1.2, but I cannot fetch any jar for this dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5610) Generate Java docs without package private classes and methods
[ https://issues.apache.org/jira/browse/SPARK-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307599#comment-14307599 ] Sean Owen commented on SPARK-5610: -- From looking at the Javadoc 8 + unidoc issue, I recall that we had a problem where {{private[foo]}} classes were being rendered as private top-level Java classes, which isn't legal to Javadoc 8. This bit of code you change is the bit that fixed that particular problem. Is this going to cause such classes to be private again? that's feels a bit wrong, since such classes aren't really meaningful in Java anyway. You can certainly tell javadoc to only generate docs for public / protected classes. In fact that should be the default. So I wonder if the right-er change is to render such classes as package-private in Java? it doesn't mean quite the same thing but may be entirely close enough for genjavadoc purposes. Generate Java docs without package private classes and methods -- Key: SPARK-5610 URL: https://issues.apache.org/jira/browse/SPARK-5610 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Xiangrui Meng Assignee: Xiangrui Meng The current generated Java doc is a mixed of public and package private classes and methods. We can update genjavadoc to hide them. Upstream PR: https://github.com/typesafehub/genjavadoc/pull/47 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1867. -- Resolution: Not a Problem I think there are a number of manifestations of the same basic problem here: mismatching versions of Spark. In each case it seems like bits and pieces of Hadoop and Spark have been built into an app, or the cluster's version of Spark was not matched with the Hadoop version. I am not 100% sure since there is a load of stuff being talked about here, but I do not see a clear problem in Spark or actionable change. Philippe I imagine you are reporting something different: SPARK-5557 Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59] My code snippet: val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(appName) .setSparkHome(sparkHome) .setJars(SparkContext.jarOfClass(this.getClass)) println(count = + new SparkContext(conf).textFile(someHdfsPath).count()) My SBT dependencies: // relevant org.apache.spark % spark-core_2.10 % 0.9.1, org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0, // standard, probably unrelated com.github.seratch %% awscala % [0.2,), org.scalacheck %% scalacheck % 1.10.1 % test, org.specs2 %% specs2 % 1.14 % test, org.scala-lang % scala-reflect % 2.10.3, org.scalaz %% scalaz-core % 7.0.5, net.minidev % json-smart %
[jira] [Commented] (SPARK-2827) Add DegreeDist function support
[ https://issues.apache.org/jira/browse/SPARK-2827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307499#comment-14307499 ] Apache Spark commented on SPARK-2827: - User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/4399 Add DegreeDist function support --- Key: SPARK-2827 URL: https://issues.apache.org/jira/browse/SPARK-2827 Project: Spark Issue Type: New Feature Components: GraphX Reporter: Lu Lu Add degree distribution operators in GraphOps for GraphX. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5620) Group methods in generated unidoc
Xiangrui Meng created SPARK-5620: Summary: Group methods in generated unidoc Key: SPARK-5620 URL: https://issues.apache.org/jira/browse/SPARK-5620 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng Having methods show up in groups makes the doc more readable. For ML, we have many parameters and their setters/getters, it is necessary to group them. Same applies to the new DataFrame API. The grouping disappeared in recent versions of sbt-unidoc. We may miss some compiler options. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4897) Python 3 support
[ https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307482#comment-14307482 ] Josh Rosen commented on SPARK-4897: --- By the way, it might be nice to see if we can figure out a good way of subdividing this task across multiple PRs so that the pieces that we have already figured out don't end up bitrotting / becoming merge-conflicts. For instance, if we can test the `cloudpickle.py` file separately from the other modules, then we could submit a PR that only adds 3.4 support to that file. If you can spot any other natural subproblems here, leave a comment or create a sub-task on this JIRA ticket. Python 3 support Key: SPARK-4897 URL: https://issues.apache.org/jira/browse/SPARK-4897 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Josh Rosen Priority: Minor It would be nice to have Python 3 support in PySpark, provided that we can do it in a way that maintains backwards-compatibility with Python 2.6. I started looking into porting this; my WIP work can be found at https://github.com/JoshRosen/spark/compare/python3 I was able to use the [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] tool to handle the basic conversion of things like {{print}} statements, etc. and had to manually fix up a few imports for packages that moved / were renamed, but the major blocker that I hit was {{cloudpickle}}: {code} [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark Python 3.4.2 (default, Oct 19 2014, 17:52:17) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin Type help, copyright, credits or license for more information. Traceback (most recent call last): File /Users/joshrosen/Documents/Spark/python/pyspark/shell.py, line 28, in module import pyspark File /Users/joshrosen/Documents/spark/python/pyspark/__init__.py, line 41, in module from pyspark.context import SparkContext File /Users/joshrosen/Documents/spark/python/pyspark/context.py, line 26, in module from pyspark import accumulators File /Users/joshrosen/Documents/spark/python/pyspark/accumulators.py, line 97, in module from pyspark.cloudpickle import CloudPickler File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 120, in module class CloudPickler(pickle.Pickler): File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 122, in CloudPickler dispatch = pickle.Pickler.dispatch.copy() AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch' {code} This code looks like it will be hard difficult to port to Python 3, so this might be a good reason to switch to [Dill|https://github.com/uqfoundation/dill] for Python serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5557) Servlet API classes now missing after jetty shading
[ https://issues.apache.org/jira/browse/SPARK-5557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308374#comment-14308374 ] Kostas Sakellis commented on SPARK-5557: [~pwendell] recommended this which did the trick: {code} diff --git a/core/pom.xml b/core/pom.xml index 2dc5f74..f03ec47 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -132,6 +132,11 @@ artifactIdjetty-servlet/artifactId scopecompile/scope /dependency +dependency + groupIdorg.eclipse.jetty.orbit/groupId + artifactIdjavax.servlet/artifactId + version3.0.0.v201112011016/version +/dependency dependency groupIdorg.apache.commons/groupId {code} Servlet API classes now missing after jetty shading --- Key: SPARK-5557 URL: https://issues.apache.org/jira/browse/SPARK-5557 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Guoqiang Li Priority: Blocker the log: {noformat} 5/02/03 19:06:39 INFO spark.HttpServer: Starting HTTP Server Exception in thread main java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse at org.apache.spark.HttpServer.org$apache$spark$HttpServer$$doStart(HttpServer.scala:75) at org.apache.spark.HttpServer$$anonfun$1.apply(HttpServer.scala:62) at org.apache.spark.HttpServer$$anonfun$1.apply(HttpServer.scala:62) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1774) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1765) at org.apache.spark.HttpServer.start(HttpServer.scala:62) at org.apache.spark.repl.SparkIMain.init(SparkIMain.scala:130) at org.apache.spark.repl.SparkILoop$SparkILoopInterpreter.init(SparkILoop.scala:185) at org.apache.spark.repl.SparkILoop.createInterpreter(SparkILoop.scala:214) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:946) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:942) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:942) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:942) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1039) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:403) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:77) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: javax.servlet.http.HttpServletResponse at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 25 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5335) Destroying cluster in VPC with --delete-groups fails to remove security groups
[ https://issues.apache.org/jira/browse/SPARK-5335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308133#comment-14308133 ] Nicholas Chammas edited comment on SPARK-5335 at 2/6/15 1:15 AM: - For the record: [AWS says|https://forums.aws.amazon.com/thread.jspa?messageID=572559] you must use the group ID (as opposed to the name) when deleting groups within a VPC. That appears to be the root cause of the issue reported here. was (Author: nchammas): For the record: [AWS says|https://forums.aws.amazon.com/thread.jspa?messageID=572559] you must use the group ID (as opposed to the name) when deleting groups within a VPC. Destroying cluster in VPC with --delete-groups fails to remove security groups Key: SPARK-5335 URL: https://issues.apache.org/jira/browse/SPARK-5335 Project: Spark Issue Type: Bug Components: EC2 Reporter: Vladimir Grigor When I try to remove security groups using option of the script, it fails because in VPC one should remove security groups by id, not name as it is now. {code} $ ./spark-ec2 -k key20141114 -i ~/key.pem --region=eu-west-1 --delete-groups destroy SparkByScript Are you sure you want to destroy the cluster SparkByScript? The following instances will be terminated: Searching for existing cluster SparkByScript... ALL DATA ON ALL NODES WILL BE LOST!! Destroy cluster SparkByScript (y/N): y Terminating master... Terminating slaves... Deleting security groups (this will take some time)... Waiting for cluster to enter 'terminated' state. Cluster is now in 'terminated' state. Waited 0 seconds. Attempt 1 Deleting rules in security group SparkByScript-slaves Deleting rules in security group SparkByScript-master ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'SparkByScript-slaves' for groupName. You may not reference Amazon VPC security groups by name. Please use the corresponding id for this operation./Message/Error/ErrorsRequestID60313fac-5d47-48dd-a8bf-e9832948c0a6/RequestID/Response Failed to delete security group SparkByScript-slaves ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'SparkByScript-master' for groupName. You may not reference Amazon VPC security groups by name. Please use the corresponding id for this operation./Message/Error/ErrorsRequestID74ff8431-c0c1-4052-9ecb-c0adfa7eeeac/RequestID/Response Failed to delete security group SparkByScript-master Attempt 2 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org