[jira] [Created] (SPARK-2714) DAGScheduler logs jobid when runJob finishes
YanTang Zhai created SPARK-2714: --- Summary: DAGScheduler logs jobid when runJob finishes Key: SPARK-2714 URL: https://issues.apache.org/jira/browse/SPARK-2714 Project: Spark Issue Type: Documentation Components: Spark Core Reporter: YanTang Zhai Priority: Minor DAGScheduler logs jobid when runJob finishes -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2613) CLONE - word2vec: Distributed Representation of Words
[ https://issues.apache.org/jira/browse/SPARK-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-2613. Assignee: Xiangrui Meng (was: Liquan Pei) CLONE - word2vec: Distributed Representation of Words - Key: SPARK-2613 URL: https://issues.apache.org/jira/browse/SPARK-2613 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yifan Yang Assignee: Xiangrui Meng Original Estimate: 672h Remaining Estimate: 672h We would like to add parallel implementation of word2vec to MLlib. word2vec finds distributed representation of words through training of large data sets. The Spark programming model fits nicely with word2vec as the training algorithm of word2vec is embarrassingly parallel. We will focus on skip-gram model and negative sampling in our initial implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2510) word2vec: Distributed Representation of Words
[ https://issues.apache.org/jira/browse/SPARK-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075951#comment-14075951 ] Xiangrui Meng commented on SPARK-2510: -- Had an offline discussion with [~liquanpei] and checked the C implementation of word2vec. It is not embarrassingly parallel because it frequently updates the global vectors, which is okay for multithreading but bad for distributed. We are thinking about making stochastic updates within each partition and then merging the vectors. Averaging works for SGD but I doubt whether it would work here. More to investigate. word2vec: Distributed Representation of Words - Key: SPARK-2510 URL: https://issues.apache.org/jira/browse/SPARK-2510 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Liquan Pei Assignee: Liquan Pei Original Estimate: 672h Remaining Estimate: 672h We would like to add parallel implementation of word2vec to MLlib. word2vec finds distributed representation of words through training of large data sets. The Spark programming model fits nicely with word2vec as the training algorithm of word2vec is embarrassingly parallel. We will focus on skip-gram model and negative sampling in our initial implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2692) Decision Tree API update
[ https://issues.apache.org/jira/browse/SPARK-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2692: - Assignee: Joseph K. Bradley Decision Tree API update Key: SPARK-2692 URL: https://issues.apache.org/jira/browse/SPARK-2692 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Split Decision Tree API into separate Classifier and Regressor classes. Details: (a) Split classes: E.g.: DecisionTree -- DecisionTreeClassifier and DecisionTreeRegressor (b) Included print() function for human-readable model descriptions (c) Renamed Strategy to *Params. Changed to take strings instead of special types. (d) Made configuration classes (Impurity, QuantileStrategy) private to mllib. (e) Changed meaning of maxDepth by 1 to match scikit-learn and rpart. (f) Removed static train() functions in favor of using Params classes. (g) Introduced DatasetInfo class for metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2692) Decision Tree API update
[ https://issues.apache.org/jira/browse/SPARK-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2692: - Target Version/s: 1.1.0 Affects Version/s: 1.0.0 Decision Tree API update Key: SPARK-2692 URL: https://issues.apache.org/jira/browse/SPARK-2692 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Split Decision Tree API into separate Classifier and Regressor classes. Details: (a) Split classes: E.g.: DecisionTree -- DecisionTreeClassifier and DecisionTreeRegressor (b) Included print() function for human-readable model descriptions (c) Renamed Strategy to *Params. Changed to take strings instead of special types. (d) Made configuration classes (Impurity, QuantileStrategy) private to mllib. (e) Changed meaning of maxDepth by 1 to match scikit-learn and rpart. (f) Removed static train() functions in favor of using Params classes. (g) Introduced DatasetInfo class for metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2715) ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes written for spilling
YanTang Zhai created SPARK-2715: --- Summary: ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes written for spilling Key: SPARK-2715 URL: https://issues.apache.org/jira/browse/SPARK-2715 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: YanTang Zhai Priority: Minor ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes written for spilling. Therefore, some task could be let fail fast instead of running for a long time if it has data skew. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2702) Upgrade Tachyon dependency to 0.5.0
[ https://issues.apache.org/jira/browse/SPARK-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyuan Li updated SPARK-2702: -- Assignee: Rong Gu Upgrade Tachyon dependency to 0.5.0 --- Key: SPARK-2702 URL: https://issues.apache.org/jira/browse/SPARK-2702 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Haoyuan Li Assignee: Rong Gu Fix For: 1.1.0 Upgrade Tachyon dependency to 0.5.0: a. Code dependency. b. Start Tachyon script. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2703) Make Tachyon related unit tests execute without deploying a Tachyon system locally.
[ https://issues.apache.org/jira/browse/SPARK-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyuan Li updated SPARK-2703: -- Assignee: Rong Gu Make Tachyon related unit tests execute without deploying a Tachyon system locally. --- Key: SPARK-2703 URL: https://issues.apache.org/jira/browse/SPARK-2703 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Haoyuan Li Assignee: Rong Gu Fix For: 1.1.0 Use the LocalTachyonCluster class in tachyon-test.jar in 0.5.0 release. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian packages created with mvn ... -Pdeb (using assembly/pom.xml)
[ https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075998#comment-14075998 ] Christian Tzolov commented on SPARK-2614: - The #1611 pull request addresses some of concerns expressed above. It doesn't put everything into a single package. Instead when the -Pdeb is enabled then 2 debian packages are be build: 1. spark_XXX_all.deb - Current spark debian package without modifications. 2. spark_XXX_examples.deb - additional deb package that bundles only the spark_examples.jar Add the spark-examples-xxx-.jar to the Debian packages created with mvn ... -Pdeb (using assembly/pom.xml) -- Key: SPARK-2614 URL: https://issues.apache.org/jira/browse/SPARK-2614 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Christian Tzolov The tar.gz distribution includes already the spark-examples.jar in the bundle. It is a common practice for installers to run SparkPi as a smoke test to verify that the installation is OK /usr/share/spark/bin/spark-submit \ --num-executors 10 --master yarn-cluster \ --class org.apache.spark.examples.SparkPi \ /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2677) BasicBlockFetchIterator#next can wait forever
[ https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076033#comment-14076033 ] Guoqiang Li commented on SPARK-2677: [~pwendell] , [~sarutak] How about the following solution? https://github.com/witgo/spark/compare/SPARK-2677 BasicBlockFetchIterator#next can wait forever - Key: SPARK-2677 URL: https://issues.apache.org/jira/browse/SPARK-2677 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.2, 1.0.0, 1.0.1 Reporter: Kousuke Saruta Priority: Blocker In BasicBlockFetchIterator#next, it waits fetch result on result.take. {code} override def next(): (BlockId, Option[Iterator[Any]]) = { resultsGotten += 1 val startFetchWait = System.currentTimeMillis() val result = results.take() val stopFetchWait = System.currentTimeMillis() _fetchWaitTime += (stopFetchWait - startFetchWait) if (! result.failed) bytesInFlight -= result.size while (!fetchRequests.isEmpty (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = maxBytesInFlight)) { sendRequest(fetchRequests.dequeue()) } (result.blockId, if (result.failed) None else Some(result.deserialize())) } {code} But, results is implemented as LinkedBlockingQueue so if remote executor hang up, fetching Executor waits forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2511) Add TF-IDF featurizer
[ https://issues.apache.org/jira/browse/SPARK-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076035#comment-14076035 ] duanfa edited comment on SPARK-2511 at 7/28/14 9:05 AM: i need it also was (Author: duanfa): i need it alse Add TF-IDF featurizer - Key: SPARK-2511 URL: https://issues.apache.org/jira/browse/SPARK-2511 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Port the TF-IDF implementation that was used in the Databricks Cloud demo to MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2511) Add TF-IDF featurizer
[ https://issues.apache.org/jira/browse/SPARK-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076035#comment-14076035 ] duanfa commented on SPARK-2511: --- i need it alse Add TF-IDF featurizer - Key: SPARK-2511 URL: https://issues.apache.org/jira/browse/SPARK-2511 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Port the TF-IDF implementation that was used in the Databricks Cloud demo to MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2511) Add TF-IDF featurizer
[ https://issues.apache.org/jira/browse/SPARK-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076035#comment-14076035 ] duanfa edited comment on SPARK-2511 at 7/28/14 9:12 AM: i need it also,i code tonight , was (Author: duanfa): i need it also Add TF-IDF featurizer - Key: SPARK-2511 URL: https://issues.apache.org/jira/browse/SPARK-2511 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Port the TF-IDF implementation that was used in the Databricks Cloud demo to MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file
[ https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076086#comment-14076086 ] Teng Qiu commented on SPARK-2576: - i get same problem, 1.0.1, standalone cluster slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file -- Key: SPARK-2576 URL: https://issues.apache.org/jira/browse/SPARK-2576 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.0.1 Environment: One Mesos 0.19 master without zookeeper and 4 mesos slaves. JDK 1.7.51 and Scala 2.10.4 on all nodes. HDFS from CDH5.0.3 Spark version: I tried both with the pre-built CDH5 spark package available from http://spark.apache.org/downloads.html and by packaging spark with sbt 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here http://mesosphere.io/learn/run-spark-on-mesos/ All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have Reporter: Svend Vanderveken Assignee: Yin Huai Priority: Blocker Fix For: 1.0.2 Execution of SQL query against HDFS systematically throws a class not found exception on slave nodes when executing . (this was originally reported on the user list: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html) Sample code (ran from spark-shell): {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Car(timestamp: Long, objectid: String, isGreen: Boolean) // I get the same error when pointing to the folder hdfs://vm28:8020/test/cardata val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0) val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), ar(2).toBoolean)) cars.registerAsTable(mcars) val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = true) allgreens.collect.take(10).foreach(println) {code} Stack trace on the slave nodes: {code} I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0 I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: mesos,mnubohadoop 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mesos, mnubohadoop) 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started 14/07/16 13:01:17 INFO Remoting: Starting remoting 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@vm23:38230] 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@vm23:38230] 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: akka.tcp://spark@vm28:41632/user/MapOutputTracker 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: akka.tcp://spark@vm28:41632/user/BlockManagerMaster 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140716130117-8ea0 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501) 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28:33973 14/07/16 13:01:18 INFO Executor: Running task ID 2 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with curMem=0, maxMem=309225062 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 122.6 KB, free 294.8 MB) 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 0.294602722 s 14/07/16 13:01:19 INFO HadoopRDD: Input split: hdfs://vm28:8020/test/cardata/part-0:23960450+23960451 I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown 14/07/16 13:01:20 ERROR Executor: Exception in task ID 2 java.lang.NoClassDefFoundError: $line11/$read$ at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at
[jira] [Commented] (SPARK-2417) Decision tree tests are failing
[ https://issues.apache.org/jira/browse/SPARK-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076101#comment-14076101 ] Patrick Morton commented on SPARK-2417: --- Hallucinogenic stroke of important metabolites during black father may affect the awkward nudity and stress of the midline, resulting in includesubtypes in the clinical fingers that control belief and execution. adderall depression http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851520/7787433-29851520-stopadd9.html For language, it would be inconclusive to choose routinely proprietary cultures to be imprinted with a far fatal history. Decision tree tests are failing --- Key: SPARK-2417 URL: https://issues.apache.org/jira/browse/SPARK-2417 Project: Spark Issue Type: Bug Components: MLlib Reporter: Patrick Wendell Assignee: Jon Sondag Fix For: 1.0.1, 1.1.0 After SPARK-2152 was merged, these tests started failing in Jenkins: {code} - classification stump with all categorical variables *** FAILED *** org.scalatest.exceptions.TestFailedException was thrown. (DecisionTreeSuite.scala:257) - regression stump with all categorical variables *** FAILED *** org.scalatest.exceptions.TestFailedException was thrown. (DecisionTreeSuite.scala:284) {code} https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/97/hadoop.version=1.0.4,label=centos/console -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2415) RowWriteSupport should handle empty ArrayType correctly.
[ https://issues.apache.org/jira/browse/SPARK-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076104#comment-14076104 ] Patrick Morton commented on SPARK-2415: --- In the ethical, three endings of core symptoms have been not linked with these benzodiazepines. http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851519/7787449-29851519-stopadd32.html Before, this firing of gym will suffer from sampling trouble because rights with thought conditions will be more other to be referred to fever structures if they are experiencing pleasurable times. RowWriteSupport should handle empty ArrayType correctly. Key: SPARK-2415 URL: https://issues.apache.org/jira/browse/SPARK-2415 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.1.0, 1.0.2 {{RowWriteSupport}} doesn't write empty {{ArrayType}} value, so the read value becomes {{null}}. It should write empty {{ArrayType}} value as it is. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2714) DAGScheduler logs jobid when runJob finishes
[ https://issues.apache.org/jira/browse/SPARK-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076232#comment-14076232 ] Apache Spark commented on SPARK-2714: - User 'YanTangZhai' has created a pull request for this issue: https://github.com/apache/spark/pull/1617 DAGScheduler logs jobid when runJob finishes Key: SPARK-2714 URL: https://issues.apache.org/jira/browse/SPARK-2714 Project: Spark Issue Type: Documentation Components: Spark Core Reporter: YanTang Zhai Priority: Minor DAGScheduler logs jobid when runJob finishes -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-2415) RowWriteSupport should handle empty ArrayType correctly.
[ https://issues.apache.org/jira/browse/SPARK-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Farrell updated SPARK-2415: Comment: was deleted (was: In the ethical, three endings of core symptoms have been not linked with these benzodiazepines. http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851519/7787449-29851519-stopadd32.html Before, this firing of gym will suffer from sampling trouble because rights with thought conditions will be more other to be referred to fever structures if they are experiencing pleasurable times.) RowWriteSupport should handle empty ArrayType correctly. Key: SPARK-2415 URL: https://issues.apache.org/jira/browse/SPARK-2415 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.1.0, 1.0.2 {{RowWriteSupport}} doesn't write empty {{ArrayType}} value, so the read value becomes {{null}}. It should write empty {{ArrayType}} value as it is. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-2417) Decision tree tests are failing
[ https://issues.apache.org/jira/browse/SPARK-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Farrell updated SPARK-2417: Comment: was deleted (was: Hallucinogenic stroke of important metabolites during black father may affect the awkward nudity and stress of the midline, resulting in includesubtypes in the clinical fingers that control belief and execution. adderall depression http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851520/7787433-29851520-stopadd9.html For language, it would be inconclusive to choose routinely proprietary cultures to be imprinted with a far fatal history.) Decision tree tests are failing --- Key: SPARK-2417 URL: https://issues.apache.org/jira/browse/SPARK-2417 Project: Spark Issue Type: Bug Components: MLlib Reporter: Patrick Wendell Assignee: Jon Sondag Fix For: 1.0.1, 1.1.0 After SPARK-2152 was merged, these tests started failing in Jenkins: {code} - classification stump with all categorical variables *** FAILED *** org.scalatest.exceptions.TestFailedException was thrown. (DecisionTreeSuite.scala:257) - regression stump with all categorical variables *** FAILED *** org.scalatest.exceptions.TestFailedException was thrown. (DecisionTreeSuite.scala:284) {code} https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/97/hadoop.version=1.0.4,label=centos/console -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2715) ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes written for spilling
[ https://issues.apache.org/jira/browse/SPARK-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076257#comment-14076257 ] Apache Spark commented on SPARK-2715: - User 'YanTangZhai' has created a pull request for this issue: https://github.com/apache/spark/pull/1618 ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes written for spilling -- Key: SPARK-2715 URL: https://issues.apache.org/jira/browse/SPARK-2715 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: YanTang Zhai Priority: Minor ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes written for spilling. Therefore, some task could be let fail fast instead of running for a long time if it has data skew. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2141) Add sc.getPersistentRDDs() to PySpark
[ https://issues.apache.org/jira/browse/SPARK-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076262#comment-14076262 ] Kan Zhang commented on SPARK-2141: -- Hi [~nchammas], we are debating potential use cases for this feature. Would be great if you could provide your input (use above link). Thx. Add sc.getPersistentRDDs() to PySpark - Key: SPARK-2141 URL: https://issues.apache.org/jira/browse/SPARK-2141 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 1.0.0 Reporter: Nicholas Chammas Assignee: Kan Zhang PySpark does not appear to have {{sc.getPersistentRDDs()}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2677) BasicBlockFetchIterator#next can wait forever
[ https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076278#comment-14076278 ] Apache Spark commented on SPARK-2677: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/1619 BasicBlockFetchIterator#next can wait forever - Key: SPARK-2677 URL: https://issues.apache.org/jira/browse/SPARK-2677 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.2, 1.0.0, 1.0.1 Reporter: Kousuke Saruta Priority: Blocker In BasicBlockFetchIterator#next, it waits fetch result on result.take. {code} override def next(): (BlockId, Option[Iterator[Any]]) = { resultsGotten += 1 val startFetchWait = System.currentTimeMillis() val result = results.take() val stopFetchWait = System.currentTimeMillis() _fetchWaitTime += (stopFetchWait - startFetchWait) if (! result.failed) bytesInFlight -= result.size while (!fetchRequests.isEmpty (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = maxBytesInFlight)) { sendRequest(fetchRequests.dequeue()) } (result.blockId, if (result.failed) None else Some(result.deserialize())) } {code} But, results is implemented as LinkedBlockingQueue so if remote executor hang up, fetching Executor waits forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2677) BasicBlockFetchIterator#next can wait forever
[ https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076033#comment-14076033 ] Guoqiang Li edited comment on SPARK-2677 at 7/28/14 3:00 PM: - [~pwendell] , [~sarutak] How about the following solution? https://github.com/apache/spark/pull/1619 was (Author: gq): [~pwendell] , [~sarutak] How about the following solution? https://github.com/witgo/spark/compare/SPARK-2677 BasicBlockFetchIterator#next can wait forever - Key: SPARK-2677 URL: https://issues.apache.org/jira/browse/SPARK-2677 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.2, 1.0.0, 1.0.1 Reporter: Kousuke Saruta Priority: Blocker In BasicBlockFetchIterator#next, it waits fetch result on result.take. {code} override def next(): (BlockId, Option[Iterator[Any]]) = { resultsGotten += 1 val startFetchWait = System.currentTimeMillis() val result = results.take() val stopFetchWait = System.currentTimeMillis() _fetchWaitTime += (stopFetchWait - startFetchWait) if (! result.failed) bytesInFlight -= result.size while (!fetchRequests.isEmpty (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = maxBytesInFlight)) { sendRequest(fetchRequests.dequeue()) } (result.blockId, if (result.failed) None else Some(result.deserialize())) } {code} But, results is implemented as LinkedBlockingQueue so if remote executor hang up, fetching Executor waits forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running applications
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076403#comment-14076403 ] Aaron Davidson commented on SPARK-1860: --- There's not an easy way to tell if an application is still running. However, the Worker has state about which executors are still running. This is really what I intended originally -- we must not clean up an Executor's own state from underneath it. I will change the title to reflect this intention. Standalone Worker cleanup should not clean up running applications -- Key: SPARK-1860 URL: https://issues.apache.org/jira/browse/SPARK-1860 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Aaron Davidson Priority: Critical Fix For: 1.1.0 The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any applications that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Applications should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-1860: -- Description: The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any executors that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Executor's log/data folders should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. was: The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any applications that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Applications should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. Standalone Worker cleanup should not clean up running executors --- Key: SPARK-1860 URL: https://issues.apache.org/jira/browse/SPARK-1860 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Aaron Davidson Priority: Critical Fix For: 1.1.0 The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any executors that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Executor's log/data folders should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2716) Having clause with no references fails to resolve
Michael Armbrust created SPARK-2716: --- Summary: Having clause with no references fails to resolve Key: SPARK-2716 URL: https://issues.apache.org/jira/browse/SPARK-2716 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical For example: {code} SELECT a FROM b GROUP BY a HAVING COUNT(*) 1 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2563) Re-open sockets to handle connect timeouts
[ https://issues.apache.org/jira/browse/SPARK-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-2563: - Description: In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions. If the connection attempt times out, the socket gets closed and from [1] we get a ClosedChannelException. We should check if the Socket was closed due to a timeout and open a new socket and try to connect. FWIW, I was able to work around my problems by increasing the number of SYN retries in Linux. (I ran echo 8 /proc/sys/net/ipv4/tcp_syn_retries) [1] http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573 was:In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions. We should make the number of retries before failing configurable to handle these cases. Summary: Re-open sockets to handle connect timeouts (was: Make number of connection retries configurable) Re-open sockets to handle connect timeouts -- Key: SPARK-2563 URL: https://issues.apache.org/jira/browse/SPARK-2563 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shivaram Venkataraman Priority: Minor In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions. If the connection attempt times out, the socket gets closed and from [1] we get a ClosedChannelException. We should check if the Socket was closed due to a timeout and open a new socket and try to connect. FWIW, I was able to work around my problems by increasing the number of SYN retries in Linux. (I ran echo 8 /proc/sys/net/ipv4/tcp_syn_retries) [1] http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2563) Re-open sockets to handle connect timeouts
[ https://issues.apache.org/jira/browse/SPARK-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065735#comment-14065735 ] Shivaram Venkataraman edited comment on SPARK-2563 at 7/28/14 5:43 PM: --- More details about the bug is -https://github.com/apache/spark/pull/1471- was (Author: shivaram): https://github.com/apache/spark/pull/1471 Re-open sockets to handle connect timeouts -- Key: SPARK-2563 URL: https://issues.apache.org/jira/browse/SPARK-2563 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shivaram Venkataraman Priority: Minor In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions. If the connection attempt times out, the socket gets closed and from [1] we get a ClosedChannelException. We should check if the Socket was closed due to a timeout and open a new socket and try to connect. FWIW, I was able to work around my problems by increasing the number of SYN retries in Linux. (I ran echo 8 /proc/sys/net/ipv4/tcp_syn_retries) [1] http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2410) Thrift/JDBC Server
[ https://issues.apache.org/jira/browse/SPARK-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076453#comment-14076453 ] Apache Spark commented on SPARK-2410: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/1620 Thrift/JDBC Server -- Key: SPARK-2410 URL: https://issues.apache.org/jira/browse/SPARK-2410 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Fix For: 1.1.0 We have this, but need to make sure that it gets merged into master before the 1.1 release. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076468#comment-14076468 ] Sean Owen commented on SPARK-2420: -- I'm sure shading just means moving the packages, and references in the byte code, with maven-shade-plugin. assembly takes very little of the total build time. Nothing else I can see except Hadoop has a Guava dependency. But yeah, there is gonna have to be a teensy fork of a Guava class maintained then. It can go in the source tree, so doesn't necessarily need more assembly surgery. Does it change your calculus? I remain slightly grossed out by all options. Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076456#comment-14076456 ] Marcelo Vanzin commented on SPARK-2420: --- So let me see if I'm following things so far. The current proposals are 1. downgrade or 2. shade (which if I understand Patrick correctly means forking Guava and changing the sources to a different package, not using the maven shade plugin?). Both options avoid overriding libraries used by Hadoop; the first by using the same one, the second by avoiding the namespace conflict. Option 1 provides less backwards compatibility issues. Shading just removes Guava from the user's classpath, so it leaves users to manage it; they'll either inherit it from Hadoop, or get into a situation where they override the classpath's Guava with their own, and potentially might break Hadoop. For both cases, I think the best recommendation is to tell the user to shade Guava in their application if they really need a newer version - that way they won't be overriding the library used by Hadoop classes. Option 1 is also less work; you don't need to maintain the shaded Guava (if I understand correctly what was meant here by shading). Using maven's shade instead means builds would get slower. Also, does anyone have an idea about whether any of the libraries Spark depends on depend on Guava and need a version later than 11? I haven't checked that. As for Guava leaking through Spark's API that's very, very unfortunate. Option 2 here will definitely break compatibility for anyone who uses those APIs. Option 1, on the other hand, has only a couple of implications: according to Guava's javadoc, only one method doesn't exist in 11 ({{transform}}) and one has a changed signature ({{presentInstances}}, and only generic arguments were changed, so maybe still binary compatible). So, pending my dependency question above, I still think that downgrading is the option that creates less headaches. Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076488#comment-14076488 ] Marcelo Vanzin commented on SPARK-2420: --- Forking {{Optional}} would make Option 2 more palatable. But shading + fork that class still feels more like a sledgehammer, and it will have pretty much the same effect on user code as downgrading, from what I can see (since now, without explicit dependencies, they'll be getting Guava 11 from Hadoop instead of Guava 14 from Spark). Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2523) For partitioned Hive tables, partition-specific ObjectInspectors should be used.
[ https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2523. - Resolution: Fixed Fix Version/s: 1.1.0 For partitioned Hive tables, partition-specific ObjectInspectors should be used. Key: SPARK-2523 URL: https://issues.apache.org/jira/browse/SPARK-2523 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Fix For: 1.1.0 In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table partitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2479) Comparing floating-point numbers using relative error in UnitTests
[ https://issues.apache.org/jira/browse/SPARK-2479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2479. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1425 [https://github.com/apache/spark/pull/1425] Comparing floating-point numbers using relative error in UnitTests -- Key: SPARK-2479 URL: https://issues.apache.org/jira/browse/SPARK-2479 Project: Spark Issue Type: Improvement Reporter: DB Tsai Assignee: DB Tsai Fix For: 1.1.0 Floating point math is not exact, and most floating-point numbers end up being slightly imprecise due to rounding errors. Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations or the precision of intermediates can change the result. That means that comparing two floats to see if they are equal is usually not what we want. As long as this imprecision stays small, it can usually be ignored. See the following famous article for detail. http://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/ For example: float a = 0.15 + 0.15 float b = 0.1 + 0.2 if(a == b) // can be false! if(a = b) // can also be false! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2544) Improve ALS algorithm resource usage
[ https://issues.apache.org/jira/browse/SPARK-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2544: - Target Version/s: 1.1.0 Improve ALS algorithm resource usage Key: SPARK-2544 URL: https://issues.apache.org/jira/browse/SPARK-2544 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Guoqiang Li The following problems in ALS 1. The RDD of products and users dependencies are too long 2. The shuffle files are too large. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2544) Improve ALS algorithm resource usage
[ https://issues.apache.org/jira/browse/SPARK-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2544: - Assignee: Guoqiang Li Improve ALS algorithm resource usage Key: SPARK-2544 URL: https://issues.apache.org/jira/browse/SPARK-2544 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Guoqiang Li Assignee: Guoqiang Li The following problems in ALS 1. The RDD of products and users dependencies are too long 2. The shuffle files are too large. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2410) Thrift/JDBC Server
[ https://issues.apache.org/jira/browse/SPARK-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2410. - Resolution: Fixed Thrift/JDBC Server -- Key: SPARK-2410 URL: https://issues.apache.org/jira/browse/SPARK-2410 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Fix For: 1.1.0 We have this, but need to make sure that it gets merged into master before the 1.1 release. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076778#comment-14076778 ] Mark Hamstra commented on SPARK-1860: - I don't think that there is much in the way of conflict, but something to be aware of is that the proposed fix to SPARK-2425 does modify Executor state transitions and cleanup: https://github.com/apache/spark/pull/1360 Standalone Worker cleanup should not clean up running executors --- Key: SPARK-1860 URL: https://issues.apache.org/jira/browse/SPARK-1860 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Aaron Davidson Priority: Critical Fix For: 1.1.0 The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any executors that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Executor's log/data folders should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2305) pyspark - depend on py4j 0.8.1
[ https://issues.apache.org/jira/browse/SPARK-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2305: -- Target Version/s: 1.1.0 Assignee: Josh Rosen Py4J 0.8.2.1 was just released; I'll look into upgrading. pyspark - depend on py4j 0.8.1 Key: SPARK-2305 URL: https://issues.apache.org/jira/browse/SPARK-2305 Project: Spark Issue Type: Dependency upgrade Components: PySpark Affects Versions: 1.0.0 Reporter: Matthew Farrellee Assignee: Josh Rosen Priority: Minor py4j 0.8.1 has a bug in java_import that results in extraneous warnings pyspark should depend on a py4j version 0.8.1 (non exists at time of filing) that includes https://github.com/bartdag/py4j/commit/64cd657e75dbe769c5e3bf757fcf83b5c0f8f4f0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2411) Standalone Master - direct users to turn on event logs
[ https://issues.apache.org/jira/browse/SPARK-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-2411. -- Resolution: Fixed Standalone Master - direct users to turn on event logs -- Key: SPARK-2411 URL: https://issues.apache.org/jira/browse/SPARK-2411 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.1 Reporter: Andrew Or Assignee: Andrew Or Fix For: 1.1.0 Attachments: Application history load error.png, Application history not found.png, Event logging not enabled.png Right now if the user attempts to click on a finished application's UI, it simply refreshes. This is simply because the event logs are not there, in which case we set the href=. We could provide more information by pointing them to configure spark.eventLog.enabled if they click on the empty link. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values
[ https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076739#comment-14076739 ] Yin Huai edited comment on SPARK-1649 at 7/28/14 8:42 PM: -- Seems Hive supports null values in a Map, to be consistent with Hive, we will also support that. I will introduce a boolean valueContainsNull to MapType. For null map keys, Hive has inconsistent behaviors. Here are examples (using sbt/sbt hive/console). {code} runSqlHive(select map(null, 1, null, 2, null, 3, 4, null, 5, null) from src limit 1) res6: Seq[String] = Buffer({4:null,5:null}) runSqlHive(select map_keys(map(null, 1, null, 2, null, 3, 4, null, 5, null)) from src limit 1) res7: Seq[String] = Buffer([null,4,5]) runSqlHive(select map_values(map(null, 1, null, 2, null, 3, 4, null, 5, null)) from src limit 1) res8: Seq[String] = Buffer([3,null,null]) {code} Also, different implementations handle null keys in different ways (e.g. HashMap supports an entry with a null key. But, TreeMap will throw a NPE when a user want to insert an entry with a null key). So, I think we will not allow null keys in a map. was (Author: yhuai): Seems Hive supports null values in a Map, to be consistent with Hive, we will also support that. I will introduce a boolean valuesContainNull to MapType. For null map keys, Hive has inconsistent behaviors. Here are examples (using sbt/sbt hive/console). {code} runSqlHive(select map(null, 1, null, 2, null, 3, 4, null, 5, null) from src limit 1) res6: Seq[String] = Buffer({4:null,5:null}) runSqlHive(select map_keys(map(null, 1, null, 2, null, 3, 4, null, 5, null)) from src limit 1) res7: Seq[String] = Buffer([null,4,5]) runSqlHive(select map_values(map(null, 1, null, 2, null, 3, 4, null, 5, null)) from src limit 1) res8: Seq[String] = Buffer([3,null,null]) {code} Also, different implementations handle null keys in different ways (e.g. HashMap supports an entry with a null key. But, TreeMap will throw a NPE when a user want to insert an entry with a null key). So, I think we will not allow null keys in a map. Figure out Nullability semantics for Array elements and Map values -- Key: SPARK-1649 URL: https://issues.apache.org/jira/browse/SPARK-1649 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Andre Schumacher Priority: Critical For the underlying storage layer it would simplify things such as schema conversions, predicate filter determination and such to record in the data type itself whether a column can be nullable. So the DataType type could look like like this: abstract class DataType(nullable: Boolean = true) Concrete subclasses could then override the nullable val. Mostly this could be left as the default but when types can be contained in nested types one could optimize for, e.g., arrays with elements that are nullable and those that are not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1687) Support NamedTuples in RDDs
[ https://issues.apache.org/jira/browse/SPARK-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076849#comment-14076849 ] Davies Liu commented on SPARK-1687: --- Dill is implemented in pure Python, so it will have similar performance with pickle, but much slower than cPickle, which is the default serializer we use as default. So we could not switch the the default serializer to Dill. We could provide an customized namedtuple (which can be serialized by cPickle), also replace the one in collections with it. I will send an PR, if it make sense. Support NamedTuples in RDDs --- Key: SPARK-1687 URL: https://issues.apache.org/jira/browse/SPARK-1687 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.0.0 Environment: Spark version 1.0.0-SNAPSHOT Python 2.7.5 Reporter: Pat McDonough Assignee: Kan Zhang Add Support for NamedTuples in RDDs. Some sample code is below, followed by the current error that comes from it. Based on a quick conversation with [~ahirreddy], [Dill|https://github.com/uqfoundation/dill] might be a good solution here. {code} In [26]: from collections import namedtuple ... In [33]: Person = namedtuple('Person', 'id firstName lastName') In [34]: jon = Person(1, Jon, Doe) In [35]: jane = Person(2, Jane, Doe) In [36]: theDoes = sc.parallelize((jon, jane)) In [37]: theDoes.collect() Out[37]: [Person(id=1, firstName='Jon', lastName='Doe'), Person(id=2, firstName='Jane', lastName='Doe')] In [38]: theDoes.count() PySpark worker failed with exception: PySpark worker failed with exception: Traceback (most recent call last): File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func def func(s, iterator): return f(iterator) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in genexpr return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, in load_stream yield self._read_with_length(stream) File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, in _read_with_length return self.loads(obj) AttributeError: 'module' object has no attribute 'Person' Traceback (most recent call last): File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func def func(s, iterator): return f(iterator) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in genexpr return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, in load_stream yield self._read_with_length(stream) File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, in _read_with_length return self.loads(obj) AttributeError: 'module' object has no attribute 'Person' 14/04/30 14:43:53 ERROR Executor: Exception in task ID 23 org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func def func(s, iterator): return f(iterator) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in
[jira] [Commented] (SPARK-2655) Change the default logging level to WARN
[ https://issues.apache.org/jira/browse/SPARK-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076863#comment-14076863 ] Davies Liu commented on SPARK-2655: --- [~pwendell] [~matei], how do you think about this? Change the default logging level to WARN Key: SPARK-2655 URL: https://issues.apache.org/jira/browse/SPARK-2655 Project: Spark Issue Type: Improvement Reporter: Davies Liu The current logging level INFO is pretty noisy, reduce these unnecessary logging will provide better experience for users. Currently, Spark is march stable and nature than before, so user will not need those much logging in normal cases. But some high level information will be helpful, such as messages about job and tasks progress, we could changes these important logging into WARN level as an hack, otherwise will need to change all other logging into level DEBUG. PS: it's better to have one line progress logging in terminal (also in title). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2717) BasicBlockFetchIterator#next should log when it gets stuck
[ https://issues.apache.org/jira/browse/SPARK-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2717: --- Component/s: Spark Core BasicBlockFetchIterator#next should log when it gets stuck -- Key: SPARK-2717 URL: https://issues.apache.org/jira/browse/SPARK-2717 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Priority: Blocker If this is stuck for a long time waiting for blocks, we should log what nodes it is waiting for to help debugging. One way to do this is to call take() with a timeout (e.g. 60 seconds) and when the timeout expires log a message for the blocks it is still waiting for. This could all happen in a loop so that the wait just restarts after the message is logged. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2717) BasicBlockFetchIterator#next should log when it gets stuck
Patrick Wendell created SPARK-2717: -- Summary: BasicBlockFetchIterator#next should log when it gets stuck Key: SPARK-2717 URL: https://issues.apache.org/jira/browse/SPARK-2717 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Priority: Blocker If this is stuck for a long time waiting for blocks, we should log what nodes it is waiting for to help debugging. One way to do this is to call take() with a timeout (e.g. 60 seconds) and when the timeout expires log a message for the blocks it is still waiting for. This could all happen in a loop so that the wait just restarts after the message is logged. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes
[ https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2718: - Affects Version/s: (was: 1.0.1) 1.0.2 YARN does not handle spark configs with quotes or backslashes - Key: SPARK-2718 URL: https://issues.apache.org/jira/browse/SPARK-2718 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Andrew Or Fix For: 1.1.0 Say we have the following config: {code} spark.app.name spark shell with spaces and quotes and \ backslashes \ {code} This works in standalone mode but not in YARN mode. This is because standalone mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes
Andrew Or created SPARK-2718: Summary: YARN does not handle spark configs with quotes or backslashes Key: SPARK-2718 URL: https://issues.apache.org/jira/browse/SPARK-2718 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.1 Reporter: Andrew Or Fix For: 1.1.0 Say we have the following config: {code} spark.app.name spark shell with spaces and quotes and \ backslashes \ {code} This works in standalone mode but not in YARN mode. This is because standalone mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes
[ https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2718: - Description: Say we have the following config: {code} spark.app.name spark shell with spaces and quotes and \ backslashes \ {code} This works in standalone mode but not in YARN mode. This is because standalone mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does not. As a result, submitting an application to YARN with the given config leads to the following exception: {code} line 0: unexpected EOF while looking for matching `' syntax error: unexpected end of file at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} was: Say we have the following config: {code} spark.app.name spark shell with spaces and quotes and \ backslashes \ {code} This works in standalone mode but not in YARN mode. This is because standalone mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext. YARN does not handle spark configs with quotes or backslashes - Key: SPARK-2718 URL: https://issues.apache.org/jira/browse/SPARK-2718 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Andrew Or Fix For: 1.1.0 Say we have the following config: {code} spark.app.name spark shell with spaces and quotes and \ backslashes \ {code} This works in standalone mode but not in YARN mode. This is because standalone mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does not. As a result, submitting an application to YARN with the given config leads to the following exception: {code} line 0: unexpected EOF while looking for matching `' syntax error: unexpected end of file at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes
[ https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2718: - Description: Say we have the following config: {code} spark.app.name spark shell with spaces and quotes and \ backslashes \ {code} This works in standalone mode but not in YARN mode. This is because standalone mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does not. As a result, submitting an application to YARN with the given config leads to the following exception: {code} line 0: unexpected EOF while looking for matching `' syntax error: unexpected end of file at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} was: Say we have the following config: {code} spark.app.name spark shell with spaces and quotes and \ backslashes \ {code} This works in standalone mode but not in YARN mode. This is because standalone mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does not. As a result, submitting an application to YARN with the given config leads to the following exception: {code} line 0: unexpected EOF while looking for matching `' syntax error: unexpected end of file at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} YARN does not handle spark configs with quotes or backslashes - Key: SPARK-2718 URL: https://issues.apache.org/jira/browse/SPARK-2718 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Andrew Or Fix For: 1.1.0 Say we have the following config: {code} spark.app.name spark shell with spaces and quotes and \ backslashes \ {code} This works in standalone mode but not in YARN mode. This is because standalone mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does not. As a result, submitting an application to YARN with the given config leads to the following exception: {code} line 0: unexpected EOF while looking for matching `' syntax error: unexpected end of file at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at
[jira] [Commented] (SPARK-1343) PySpark OOMs without caching
[ https://issues.apache.org/jira/browse/SPARK-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077011#comment-14077011 ] Davies Liu commented on SPARK-1343: --- Maybe it's related to partitionBy() with small number of partitions, the data in one partition will send to JVM as several huge bytearray, they will cost huge memory before writing into disks, because default spark.serializer.objectStreamReset is too large. Hopefully, PR-1568 and PR-1460 will fix these issues. Close this now, will re-open it if it happens again. PySpark OOMs without caching Key: SPARK-1343 URL: https://issues.apache.org/jira/browse/SPARK-1343 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0 Reporter: Matei Zaharia There have been several reports on the list of PySpark 0.9 OOMing even if it does simple maps and counts, whereas 0.9 didn't. This may be due to either the batching added to serialization, or due to invalid serialized data which makes the Java side allocate an overly large array. Needs investigating for 1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes
[ https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2718: - Description: Say we have the following config: {code} spark.app.name spark shell with spaces and quotes and \ backslashes \ {code} This works in standalone mode but not in YARN mode. This is because standalone mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does not. As a result, submitting an application to YARN with the given config leads to the following exception: {code} line 0: unexpected EOF while looking for matching `' syntax error: unexpected end of file at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) ... {code} was: Say we have the following config: {code} spark.app.name spark shell with spaces and quotes and \ backslashes \ {code} This works in standalone mode but not in YARN mode. This is because standalone mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does not. As a result, submitting an application to YARN with the given config leads to the following exception: {code} line 0: unexpected EOF while looking for matching `' syntax error: unexpected end of file at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} YARN does not handle spark configs with quotes or backslashes - Key: SPARK-2718 URL: https://issues.apache.org/jira/browse/SPARK-2718 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Andrew Or Fix For: 1.1.0 Say we have the following config: {code} spark.app.name spark shell with spaces and quotes and \ backslashes \ {code} This works in standalone mode but not in YARN mode. This is because standalone mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does not. As a result, submitting an application to YARN with the given config leads to the following exception: {code} line 0: unexpected EOF while looking for matching `' syntax error: unexpected end of file at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) ... {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1343) PySpark OOMs without caching
[ https://issues.apache.org/jira/browse/SPARK-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-1343. --- Resolution: Fixed Fix Version/s: 0.9.0 1.0.0 Target Version/s: 1.1.0 PySpark OOMs without caching Key: SPARK-1343 URL: https://issues.apache.org/jira/browse/SPARK-1343 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0 Reporter: Matei Zaharia Assignee: Davies Liu Fix For: 1.0.0, 0.9.0 There have been several reports on the list of PySpark 0.9 OOMing even if it does simple maps and counts, whereas 0.9 didn't. This may be due to either the batching added to serialization, or due to invalid serialized data which makes the Java side allocate an overly large array. Needs investigating for 1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1343) PySpark OOMs without caching
[ https://issues.apache.org/jira/browse/SPARK-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077012#comment-14077012 ] Davies Liu commented on SPARK-1343: --- https://github.com/apache/spark/pull/1460 https://github.com/apache/spark/pull/1568 PySpark OOMs without caching Key: SPARK-1343 URL: https://issues.apache.org/jira/browse/SPARK-1343 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0 Reporter: Matei Zaharia Fix For: 0.9.0, 1.0.0 There have been several reports on the list of PySpark 0.9 OOMing even if it does simple maps and counts, whereas 0.9 didn't. This may be due to either the batching added to serialization, or due to invalid serialized data which makes the Java side allocate an overly large array. Needs investigating for 1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1687) Support NamedTuples in RDDs
[ https://issues.apache.org/jira/browse/SPARK-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077019#comment-14077019 ] Kan Zhang commented on SPARK-1687: -- Sure, pls go ahead and feel free to take over this JIRA. Support NamedTuples in RDDs --- Key: SPARK-1687 URL: https://issues.apache.org/jira/browse/SPARK-1687 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.0.0 Environment: Spark version 1.0.0-SNAPSHOT Python 2.7.5 Reporter: Pat McDonough Assignee: Kan Zhang Add Support for NamedTuples in RDDs. Some sample code is below, followed by the current error that comes from it. Based on a quick conversation with [~ahirreddy], [Dill|https://github.com/uqfoundation/dill] might be a good solution here. {code} In [26]: from collections import namedtuple ... In [33]: Person = namedtuple('Person', 'id firstName lastName') In [34]: jon = Person(1, Jon, Doe) In [35]: jane = Person(2, Jane, Doe) In [36]: theDoes = sc.parallelize((jon, jane)) In [37]: theDoes.collect() Out[37]: [Person(id=1, firstName='Jon', lastName='Doe'), Person(id=2, firstName='Jane', lastName='Doe')] In [38]: theDoes.count() PySpark worker failed with exception: PySpark worker failed with exception: Traceback (most recent call last): File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func def func(s, iterator): return f(iterator) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in genexpr return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, in load_stream yield self._read_with_length(stream) File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, in _read_with_length return self.loads(obj) AttributeError: 'module' object has no attribute 'Person' Traceback (most recent call last): File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func def func(s, iterator): return f(iterator) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in genexpr return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, in load_stream yield self._read_with_length(stream) File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, in _read_with_length return self.loads(obj) AttributeError: 'module' object has no attribute 'Person' 14/04/30 14:43:53 ERROR Executor: Exception in task ID 23 org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func def func(s, iterator): return f(iterator) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in genexpr return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, in load_stream yield self._read_with_length(stream) File
[jira] [Commented] (SPARK-2023) PySpark reduce does a map side reduce and then sends the results to the driver for final reduce, instead do this more like Scala Spark.
[ https://issues.apache.org/jira/browse/SPARK-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077075#comment-14077075 ] Davies Liu commented on SPARK-2023: --- In most cases, the result of reduce will be small, so collect these small data from each partition then reduce them will not be bottleneck. PySpark reduce does a map side reduce and then sends the results to the driver for final reduce, instead do this more like Scala Spark. --- Key: SPARK-2023 URL: https://issues.apache.org/jira/browse/SPARK-2023 Project: Spark Issue Type: Improvement Components: PySpark Reporter: holdenk PySpark reduce does a map side reduce and then sends the results to the driver for final reduce, instead do this more like Scala Spark. The current implementation could be a bottleneck. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2719) Add Mima binary checks to Flume-Sink
Tathagata Das created SPARK-2719: Summary: Add Mima binary checks to Flume-Sink Key: SPARK-2719 URL: https://issues.apache.org/jira/browse/SPARK-2719 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Tathagata Das Priority: Minor Mima binary check has been disabled for flume-sink in 1.1, as previous version of flume-sink does not exist. This should be enabled for 1.2. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2022) Spark 1.0.0 is failing if mesos.coarse set to true
[ https://issues.apache.org/jira/browse/SPARK-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077109#comment-14077109 ] Timothy Chen commented on SPARK-2022: - Github PR: https://github.com/apache/spark/pull/1622 Spark 1.0.0 is failing if mesos.coarse set to true -- Key: SPARK-2022 URL: https://issues.apache.org/jira/browse/SPARK-2022 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: Marek Wiewiorka Assignee: Tim Chen Priority: Critical more stderr --- WARNING: Logging before InitGoogleLogging() is written to STDERR I0603 16:07:53.721132 61192 exec.cpp:131] Version: 0.18.2 I0603 16:07:53.725230 61200 exec.cpp:205] Executor registered on slave 201405220917-134217738-5050-27119-0 Exception in thread main java.lang.NumberFormatException: For input string: sparkseq003.cloudapp.net at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.parseInt(Integer.java:527) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) more stdout --- Registered executor on sparkseq003.cloudapp.net Starting task 5 Forked command at 61202 sh -c '/home/mesos/spark-1.0.0/bin/spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend -Dspark.mesos.coarse=true akka.tcp://sp...@sparkseq001.cloudapp.net:40312/user/CoarseG rainedScheduler 201405220917-134217738-5050-27119-0 sparkseq003.cloudapp.net 4' Command exited with status 1 (pid: 61202) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2022) Spark 1.0.0 is failing if mesos.coarse set to true
[ https://issues.apache.org/jira/browse/SPARK-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077110#comment-14077110 ] Apache Spark commented on SPARK-2022: - User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/1622 Spark 1.0.0 is failing if mesos.coarse set to true -- Key: SPARK-2022 URL: https://issues.apache.org/jira/browse/SPARK-2022 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: Marek Wiewiorka Assignee: Tim Chen Priority: Critical more stderr --- WARNING: Logging before InitGoogleLogging() is written to STDERR I0603 16:07:53.721132 61192 exec.cpp:131] Version: 0.18.2 I0603 16:07:53.725230 61200 exec.cpp:205] Executor registered on slave 201405220917-134217738-5050-27119-0 Exception in thread main java.lang.NumberFormatException: For input string: sparkseq003.cloudapp.net at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.parseInt(Integer.java:527) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) more stdout --- Registered executor on sparkseq003.cloudapp.net Starting task 5 Forked command at 61202 sh -c '/home/mesos/spark-1.0.0/bin/spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend -Dspark.mesos.coarse=true akka.tcp://sp...@sparkseq001.cloudapp.net:40312/user/CoarseG rainedScheduler 201405220917-134217738-5050-27119-0 sparkseq003.cloudapp.net 4' Command exited with status 1 (pid: 61202) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values
[ https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077115#comment-14077115 ] Robbie Russo commented on SPARK-1649: - Thrift also supports null values in a map and this makes any thrift generated parquet files that contain a map unreadable by spark sql due to the following code in parquet-thrift for generating the schema for maps: {code:title=parquet.thrift.ThriftSchemaConverter.java|borderStyle=solid} @Override public void visit(ThriftType.MapType mapType) { final ThriftField mapKeyField = mapType.getKey(); final ThriftField mapValueField = mapType.getValue(); //save env for map String mapName = currentName; Type.Repetition mapRepetition = currentRepetition; //=handle key currentFieldPath.push(mapKeyField); currentName = key; currentRepetition = REQUIRED; mapKeyField.getType().accept(this); Type keyType = currentType;//currentType is the already converted type currentFieldPath.pop(); //=handle value currentFieldPath.push(mapValueField); currentName = value; currentRepetition = OPTIONAL; mapValueField.getType().accept(this); Type valueType = currentType; currentFieldPath.pop(); if (keyType == null valueType == null) { currentType = null; return; } if (keyType == null valueType != null) throw new ThriftProjectionException(key of map is not specified in projection: + currentFieldPath); //restore Env currentName = mapName; currentRepetition = mapRepetition; currentType = ConversionPatterns.mapType(currentRepetition, currentName, keyType, valueType); } {code} Which causes an error on the spark side when we reach this step in the toDataType function that asserts that both the key and value are of repetition level REQUIRED: {code:title=org.apache.spark.sql.parquet.ParquetTypes.scala|borderStyle=solid} case ParquetOriginalType.MAP = { assert( !groupType.getFields.apply(0).isPrimitive, Parquet Map type malformatted: expected nested group for map!) val keyValueGroup = groupType.getFields.apply(0).asGroupType() assert( keyValueGroup.getFieldCount == 2, Parquet Map type malformatted: nested group should have 2 (key, value) fields!) val keyType = toDataType(keyValueGroup.getFields.apply(0)) println(here) assert(keyValueGroup.getFields.apply(0).getRepetition == Repetition.REQUIRED) val valueType = toDataType(keyValueGroup.getFields.apply(1)) assert(keyValueGroup.getFields.apply(1).getRepetition == Repetition.REQUIRED) new MapType(keyType, valueType) } {code} Currently I have modified parquet-thrift to use repetition REQUIRED just to make spark sql able to work on the parquet files since we don't actually use null values in our maps. However it would be preferred to use parquet-thrift and spark sql out of the box and have them work nicely together with our existing thrift data types without having to modify dependencies. Figure out Nullability semantics for Array elements and Map values -- Key: SPARK-1649 URL: https://issues.apache.org/jira/browse/SPARK-1649 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Andre Schumacher Priority: Critical For the underlying storage layer it would simplify things such as schema conversions, predicate filter determination and such to record in the data type itself whether a column can be nullable. So the DataType type could look like like this: abstract class DataType(nullable: Boolean = true) Concrete subclasses could then override the nullable val. Mostly this could be left as the default but when types can be contained in nested types one could optimize for, e.g., arrays with elements that are nullable and those that are not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values
[ https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077127#comment-14077127 ] Yin Huai commented on SPARK-1649: - [~rrusso2007] Can you open a JIRA for the issue of reading Parquet datasets? Figure out Nullability semantics for Array elements and Map values -- Key: SPARK-1649 URL: https://issues.apache.org/jira/browse/SPARK-1649 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Andre Schumacher Priority: Critical For the underlying storage layer it would simplify things such as schema conversions, predicate filter determination and such to record in the data type itself whether a column can be nullable. So the DataType type could look like like this: abstract class DataType(nullable: Boolean = true) Concrete subclasses could then override the nullable val. Mostly this could be left as the default but when types can be contained in nested types one could optimize for, e.g., arrays with elements that are nullable and those that are not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1687) Support NamedTuples in RDDs
[ https://issues.apache.org/jira/browse/SPARK-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077136#comment-14077136 ] Apache Spark commented on SPARK-1687: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/1623 Support NamedTuples in RDDs --- Key: SPARK-1687 URL: https://issues.apache.org/jira/browse/SPARK-1687 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.0.0 Environment: Spark version 1.0.0-SNAPSHOT Python 2.7.5 Reporter: Pat McDonough Assignee: Davies Liu Add Support for NamedTuples in RDDs. Some sample code is below, followed by the current error that comes from it. Based on a quick conversation with [~ahirreddy], [Dill|https://github.com/uqfoundation/dill] might be a good solution here. {code} In [26]: from collections import namedtuple ... In [33]: Person = namedtuple('Person', 'id firstName lastName') In [34]: jon = Person(1, Jon, Doe) In [35]: jane = Person(2, Jane, Doe) In [36]: theDoes = sc.parallelize((jon, jane)) In [37]: theDoes.collect() Out[37]: [Person(id=1, firstName='Jon', lastName='Doe'), Person(id=2, firstName='Jane', lastName='Doe')] In [38]: theDoes.count() PySpark worker failed with exception: PySpark worker failed with exception: Traceback (most recent call last): File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func def func(s, iterator): return f(iterator) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in genexpr return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, in load_stream yield self._read_with_length(stream) File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, in _read_with_length return self.loads(obj) AttributeError: 'module' object has no attribute 'Person' Traceback (most recent call last): File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func def func(s, iterator): return f(iterator) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in genexpr return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, in load_stream yield self._read_with_length(stream) File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, in _read_with_length return self.loads(obj) AttributeError: 'module' object has no attribute 'Person' 14/04/30 14:43:53 ERROR Executor: Exception in task ID 23 org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func def func(s, iterator): return f(iterator) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in genexpr return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, in load_stream yield
[jira] [Created] (SPARK-2720) spark-examples should depend on HBase modules for HBase 0.96+
Ted Yu created SPARK-2720: - Summary: spark-examples should depend on HBase modules for HBase 0.96+ Key: SPARK-2720 URL: https://issues.apache.org/jira/browse/SPARK-2720 Project: Spark Issue Type: Task Reporter: Ted Yu Priority: Minor With this change: {code} diff --git a/pom.xml b/pom.xml index 93ef3b9..092430a 100644 --- a/pom.xml +++ b/pom.xml @@ -122,7 +122,7 @@ hadoop.version1.0.4/hadoop.version protobuf.version2.4.1/protobuf.version yarn.version${hadoop.version}/yarn.version -hbase.version0.94.6/hbase.version +hbase.version0.98.4/hbase.version zookeeper.version3.4.5/zookeeper.version hive.version0.12.0/hive.version parquet.version1.4.3/parquet.version {code} I got: {code} [ERROR] Failed to execute goal on project spark-examples_2.10: Could not resolve dependencies for project org.apache.spark:spark-examples_2.10:jar:1.1.0-SNAPSHOT: Could not find artifact org.apache.hbase:hbase:jar:0.98.4 in maven-repo (http://repo.maven.apache.org/maven2) - [Help 1] {code} To build against HBase 0.96+, spark-examples needs to specify HBase modules (hbase-client, etc) in dependencies - possibly using a new profile. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values
[ https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077150#comment-14077150 ] Robbie Russo commented on SPARK-1649: - Just opened https://issues.apache.org/jira/browse/SPARK-2721 Figure out Nullability semantics for Array elements and Map values -- Key: SPARK-1649 URL: https://issues.apache.org/jira/browse/SPARK-1649 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Andre Schumacher Priority: Critical For the underlying storage layer it would simplify things such as schema conversions, predicate filter determination and such to record in the data type itself whether a column can be nullable. So the DataType type could look like like this: abstract class DataType(nullable: Boolean = true) Concrete subclasses could then override the nullable val. Mostly this could be left as the default but when types can be contained in nested types one could optimize for, e.g., arrays with elements that are nullable and those that are not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2721) Fix MapType compatibility issues with reading Parquet datasets
Robbie Russo created SPARK-2721: --- Summary: Fix MapType compatibility issues with reading Parquet datasets Key: SPARK-2721 URL: https://issues.apache.org/jira/browse/SPARK-2721 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.1 Reporter: Robbie Russo Parquet-thrift (along with most likely other implementations of parquet) supports null values in a map and this makes any thrift generated parquet files that contain a map unreadable by spark sql due to the following code in parquet-thrift for generating the schema for maps: {code:title=parquet.thrift.ThriftSchemaConverter.java|borderStyle=solid} @Override public void visit(ThriftType.MapType mapType) { final ThriftField mapKeyField = mapType.getKey(); final ThriftField mapValueField = mapType.getValue(); //save env for map String mapName = currentName; Type.Repetition mapRepetition = currentRepetition; //=handle key currentFieldPath.push(mapKeyField); currentName = key; currentRepetition = REQUIRED; mapKeyField.getType().accept(this); Type keyType = currentType;//currentType is the already converted type currentFieldPath.pop(); //=handle value currentFieldPath.push(mapValueField); currentName = value; currentRepetition = OPTIONAL; mapValueField.getType().accept(this); Type valueType = currentType; currentFieldPath.pop(); if (keyType == null valueType == null) { currentType = null; return; } if (keyType == null valueType != null) throw new ThriftProjectionException(key of map is not specified in projection: + currentFieldPath); //restore Env currentName = mapName; currentRepetition = mapRepetition; currentType = ConversionPatterns.mapType(currentRepetition, currentName, keyType, valueType); } {code} Which causes an error on the spark side when we reach this step in the toDataType function that asserts that both the key and value are of repetition level REQUIRED: {code:title=org.apache.spark.sql.parquet.ParquetTypes.scala|borderStyle=solid} case ParquetOriginalType.MAP = { assert( !groupType.getFields.apply(0).isPrimitive, Parquet Map type malformatted: expected nested group for map!) val keyValueGroup = groupType.getFields.apply(0).asGroupType() assert( keyValueGroup.getFieldCount == 2, Parquet Map type malformatted: nested group should have 2 (key, value) fields!) val keyType = toDataType(keyValueGroup.getFields.apply(0)) println(here) assert(keyValueGroup.getFields.apply(0).getRepetition == Repetition.REQUIRED) val valueType = toDataType(keyValueGroup.getFields.apply(1)) assert(keyValueGroup.getFields.apply(1).getRepetition == Repetition.REQUIRED) new MapType(keyType, valueType) } {code} Currently I have modified parquet-thrift to use repetition REQUIRED just to make spark sql able to work on the parquet files since we don't actually use null values in our maps. However it would be preferred to use parquet-thrift and spark sql out of the box and have them work nicely together with our existing thrift data types without having to modify dependencies. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2550) Support regularization and intercept in pyspark's linear methods
[ https://issues.apache.org/jira/browse/SPARK-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077155#comment-14077155 ] Michael Yannakopoulos commented on SPARK-2550: -- Please ignore the previous pull request since it did not include the commits that should appear related to the aforementioned issue. The new correct pull request is the following: [https://github.com/apache/spark/pull/1624] Support regularization and intercept in pyspark's linear methods Key: SPARK-2550 URL: https://issues.apache.org/jira/browse/SPARK-2550 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Michael Yannakopoulos Python API doesn't provide options to set regularization parameter and intercept in linear methods, which should be fixed in v1.1. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2550) Support regularization and intercept in pyspark's linear methods
[ https://issues.apache.org/jira/browse/SPARK-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077158#comment-14077158 ] Apache Spark commented on SPARK-2550: - User 'miccagiann' has created a pull request for this issue: https://github.com/apache/spark/pull/1624 Support regularization and intercept in pyspark's linear methods Key: SPARK-2550 URL: https://issues.apache.org/jira/browse/SPARK-2550 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Michael Yannakopoulos Python API doesn't provide options to set regularization parameter and intercept in linear methods, which should be fixed in v1.1. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2382) build error:
[ https://issues.apache.org/jira/browse/SPARK-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077161#comment-14077161 ] Mukul Jain commented on SPARK-2382: --- how to open PR ? I am planning to close this issue with comments : Building of some of the projects, such as Project External MQTT, require explicit access over HTTPS . so make sure your build machine is configured properly to download dependencies over HTTPS ( check HTTPS proxy configuration and such). build error: - Key: SPARK-2382 URL: https://issues.apache.org/jira/browse/SPARK-2382 Project: Spark Issue Type: Question Components: Build Affects Versions: 1.0.0 Environment: Ubuntu 12.0.4 precise. spark@ubuntu-cdh5-spark:~/spark-1.0.0$ mvn -version Apache Maven 3.0.4 Maven home: /usr/share/maven Java version: 1.6.0_31, vendor: Sun Microsystems Inc. Java home: /usr/lib/jvm/j2sdk1.6-oracle/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 3.11.0-15-generic, arch: amd64, family: unix Reporter: Mukul Jain Labels: newbie Unable to build. maven can't download dependency .. checked my http_proxy and https_proxy setting they are working fine. Other http and https dependencies were downloaded fine.. build process gets stuck always at this repo. manually down loading also fails and receive an exception. [INFO] [INFO] Building Spark Project External MQTT 1.0.0 [INFO] Downloading: https://repository.apache.org/content/repositories/releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry INFO: I/O exception (java.net.ConnectException) caught when processing request: Connection timed out Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry INFO: Retrying request -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077179#comment-14077179 ] Ted Malaska commented on SPARK-2447: Making good progress. Just FYI it may take a little longer because the version of HBase in Spark is 94.1 which has a couple different APIs. Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Tathagata Das Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2580) broken pipe collecting schemardd results
[ https://issues.apache.org/jira/browse/SPARK-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077193#comment-14077193 ] Apache Spark commented on SPARK-2580: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/1625 broken pipe collecting schemardd results Key: SPARK-2580 URL: https://issues.apache.org/jira/browse/SPARK-2580 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.0.0 Environment: fedora 21 local and rhel 7 clustered (standalone) Reporter: Matthew Farrellee Assignee: Davies Liu Labels: py4j, pyspark {code} from pyspark.sql import SQLContext sqlCtx = SQLContext(sc) # size of cluster impacts where this breaks (i.e. 2**15 vs 2**2) data = sc.parallelize([{'name': 'index', 'value': 0}] * 2**20) sdata = sqlCtx.inferSchema(data) sdata.first() {code} result: note - result returned as well as error {code} sdata.first() 14/07/18 12:10:25 INFO SparkContext: Starting job: runJob at PythonRDD.scala:290 14/07/18 12:10:25 INFO DAGScheduler: Got job 43 (runJob at PythonRDD.scala:290) with 1 output partitions (allowLocal=true) 14/07/18 12:10:25 INFO DAGScheduler: Final stage: Stage 52(runJob at PythonRDD.scala:290) 14/07/18 12:10:25 INFO DAGScheduler: Parents of final stage: List() 14/07/18 12:10:25 INFO DAGScheduler: Missing parents: List() 14/07/18 12:10:25 INFO DAGScheduler: Computing the requested partition locally 14/07/18 12:10:25 INFO PythonRDD: Times: total = 45, boot = 3, init = 40, finish = 2 14/07/18 12:10:25 INFO SparkContext: Job finished: runJob at PythonRDD.scala:290, took 0.048348426 s {u'name': u'index', u'value': 0} PySpark worker failed with exception: Traceback (most recent call last): File /home/matt/Documents/Repositories/spark/dist/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, line 191, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, line 124, in dump_stream self._write_with_length(obj, stream) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, line 139, in _write_with_length stream.write(serialized) IOError: [Errno 32] Broken pipe Traceback (most recent call last): File /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, line 130, in launch_worker worker(listen_sock) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, line 119, in worker outfile.flush() IOError: [Errno 32] Broken pipe {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2305) pyspark - depend on py4j 0.8.1
[ https://issues.apache.org/jira/browse/SPARK-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077202#comment-14077202 ] Apache Spark commented on SPARK-2305: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/1626 pyspark - depend on py4j 0.8.1 Key: SPARK-2305 URL: https://issues.apache.org/jira/browse/SPARK-2305 Project: Spark Issue Type: Dependency upgrade Components: PySpark Affects Versions: 1.0.0 Reporter: Matthew Farrellee Assignee: Josh Rosen Priority: Minor py4j 0.8.1 has a bug in java_import that results in extraneous warnings pyspark should depend on a py4j version 0.8.1 (non exists at time of filing) that includes https://github.com/bartdag/py4j/commit/64cd657e75dbe769c5e3bf757fcf83b5c0f8f4f0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2722) Mechanism for escaping spark configs is not consistent
Andrew Or created SPARK-2722: Summary: Mechanism for escaping spark configs is not consistent Key: SPARK-2722 URL: https://issues.apache.org/jira/browse/SPARK-2722 Project: Spark Issue Type: Bug Affects Versions: 1.0.1 Reporter: Andrew Or Priority: Minor Fix For: 1.1.0 Currently, you can specify a spark config in spark-defaults.conf as follows: {code} spark.magic Mr. Johnson {code} and this will preserve the double quotes as part of the string. Naturally, if you want to do the equivalent in spark.*.extraJavaOptions, you would use the following {code} spark.executor.extraJavaOptions -Dmagic=\Mr. Johnson\ {code} However, this fails because the backslashes go away and it tries to interpret Johnson as the main class argument. Instead, you have to do the following {code} spark.executor.extraJavaOptions -Dmagic=\\\Mr. Johnson\\\ {code} which is not super intuitive. Note that this only applies to standalone mode. In YARN it's not even possible to use quoted strings in config values (SPARK-2718). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2722) Mechanism for escaping spark configs is not consistent
[ https://issues.apache.org/jira/browse/SPARK-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2722: - Description: Currently, you can specify a spark config in spark-defaults.conf as follows: {code} spark.magic Mr. Johnson {code} and this will preserve the double quotes as part of the string. Naturally, if you want to do the equivalent in spark.*.extraJavaOptions, you would use the following: {code} spark.executor.extraJavaOptions -Dmagic=\Mr. Johnson\ {code} However, this fails because the backslashes go away and it tries to interpret Johnson as the main class argument. Instead, you have to do the following: {code} spark.executor.extraJavaOptions -Dmagic=\\\Mr. Johnson\\\ {code} which is not super intuitive. Note that this only applies to standalone mode. In YARN it's not even possible to use quoted strings in config values (SPARK-2718). was: Currently, you can specify a spark config in spark-defaults.conf as follows: {code} spark.magic Mr. Johnson {code} and this will preserve the double quotes as part of the string. Naturally, if you want to do the equivalent in spark.*.extraJavaOptions, you would use the following {code} spark.executor.extraJavaOptions -Dmagic=\Mr. Johnson\ {code} However, this fails because the backslashes go away and it tries to interpret Johnson as the main class argument. Instead, you have to do the following {code} spark.executor.extraJavaOptions -Dmagic=\\\Mr. Johnson\\\ {code} which is not super intuitive. Note that this only applies to standalone mode. In YARN it's not even possible to use quoted strings in config values (SPARK-2718). Mechanism for escaping spark configs is not consistent -- Key: SPARK-2722 URL: https://issues.apache.org/jira/browse/SPARK-2722 Project: Spark Issue Type: Bug Affects Versions: 1.0.1 Reporter: Andrew Or Priority: Minor Fix For: 1.1.0 Currently, you can specify a spark config in spark-defaults.conf as follows: {code} spark.magic Mr. Johnson {code} and this will preserve the double quotes as part of the string. Naturally, if you want to do the equivalent in spark.*.extraJavaOptions, you would use the following: {code} spark.executor.extraJavaOptions -Dmagic=\Mr. Johnson\ {code} However, this fails because the backslashes go away and it tries to interpret Johnson as the main class argument. Instead, you have to do the following: {code} spark.executor.extraJavaOptions -Dmagic=\\\Mr. Johnson\\\ {code} which is not super intuitive. Note that this only applies to standalone mode. In YARN it's not even possible to use quoted strings in config values (SPARK-2718). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-791) [pyspark] operator.getattr not serialized
[ https://issues.apache.org/jira/browse/SPARK-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077239#comment-14077239 ] Davies Liu commented on SPARK-791: -- This will be fixed by PR-1627[1] [1] https://github.com/apache/spark/pull/1627 [pyspark] operator.getattr not serialized - Key: SPARK-791 URL: https://issues.apache.org/jira/browse/SPARK-791 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.7.2, 0.9.0 Reporter: Jim Blomo Priority: Minor Using operator.itemgetter as a function in map seems to confuse the serialization process in pyspark. I'm using itemgetter to return tuples, which fails with a TypeError (details below). Using an equivalent lambda function returns the correct result. Use a test file: {code:sh} echo 1,1 test.txt {code} Then try mapping it to a tuple: {code:python} import csv sc.textFile(test.txt).mapPartitions(csv.reader).map(lambda l: (l[0],l[1])).first() Out[7]: ('1', '1') {code} But this does not work when using operator.itemgetter: {code:python} import operator sc.textFile(test.txt).mapPartitions(csv.reader).map(operator.itemgetter(0,1)).first() # TypeError: list indices must be integers, not tuple {code} This is running with git master, commit 6d60fe571a405eb9306a2be1817901316a46f892 IPython 0.13.2 java version 1.7.0_25 Scala code runner version 2.9.1 Ubuntu 12.04 Full debug output: {code:python} In [9]: sc.textFile(test.txt).mapPartitions(csv.reader).map(operator.itemgetter(0,1)).first() 13/07/04 16:19:49 INFO storage.MemoryStore: ensureFreeSpace(33632) called with curMem=201792, maxMem=339585269 13/07/04 16:19:49 INFO storage.MemoryStore: Block broadcast_6 stored as values to memory (estimated size 32.8 KB, free 323.6 MB) 13/07/04 16:19:49 INFO mapred.FileInputFormat: Total input paths to process : 1 13/07/04 16:19:49 INFO spark.SparkContext: Starting job: takePartition at NativeMethodAccessorImpl.java:-2 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Got job 4 (takePartition at NativeMethodAccessorImpl.java:-2) with 1 output partitions (allowLocal=true) 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Final stage: Stage 4 (PythonRDD at NativeConstructorAccessorImpl.java:-2) 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Parents of final stage: List() 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Missing parents: List() 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Computing the requested partition locally 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Failed to run takePartition at NativeMethodAccessorImpl.java:-2 --- Py4JJavaError Traceback (most recent call last) ipython-input-9-1fdb3e7a8ac7 in module() 1 sc.textFile(test.txt).mapPartitions(csv.reader).map(operator.itemgetter(0,1)).first() /home/jim/src/spark/python/pyspark/rdd.pyc in first(self) 389 2 390 -- 391 return self.take(1)[0] 392 393 def saveAsTextFile(self, path): /home/jim/src/spark/python/pyspark/rdd.pyc in take(self, num) 372 items = [] 373 for partition in range(self._jrdd.splits().size()): -- 374 iterator = self.ctx._takePartition(self._jrdd.rdd(), partition) 375 # Each item in the iterator is a string, Python object, batch of 376 # Python objects. Regardless, it is sufficient to take `num` /home/jim/src/spark/python/lib/py4j0.7.egg/py4j/java_gateway.pyc in __call__(self, *args) 498 answer = self.gateway_client.send_command(command) 499 return_value = get_return_value(answer, self.gateway_client, -- 500 self.target_id, self.name) 501 502 for temp_arg in temp_args: /home/jim/src/spark/python/lib/py4j0.7.egg/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling z:spark.api.python.PythonRDD.takePartition. : spark.api.python.PythonException: Traceback (most recent call last): File /home/jim/src/spark/python/pyspark/worker.py, line 53, in main for obj in func(split_index, iterator): File /home/jim/src/spark/python/pyspark/serializers.py, line 24, in batched for item in iterator: TypeError: list indices must be integers, not tuple at spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:117) at
[jira] [Commented] (SPARK-1138) Spark 0.9.0 does not work with Hadoop / HDFS
[ https://issues.apache.org/jira/browse/SPARK-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077270#comment-14077270 ] Russell Jurney commented on SPARK-1138: --- I built spark master with 'sbt/sbt assembly publish-local' and had issues with my hadoop version, which is CDH 4.4. Then I built with CDH 4.4, via: 'SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.4.0 sbt/sbt assembly publish-local'. Note: I did not clean. I saw this issue. This is with Spark trunk when the released version is 1.0.1. Then I cleaned and rebuilt. The issue persists. What should I do? Spark 0.9.0 does not work with Hadoop / HDFS Key: SPARK-1138 URL: https://issues.apache.org/jira/browse/SPARK-1138 Project: Spark Issue Type: Bug Reporter: Sam Abeyratne UPDATE: This problem is certainly related to trying to use Spark 0.9.0 and the latest cloudera Hadoop / HDFS in the same jar. It seems no matter how I fiddle with the deps, the do not play nice together. I'm getting a java.util.concurrent.TimeoutException when trying to create a spark context with 0.9. I cannot, whatever I do, change the timeout. I've tried using System.setProperty, the SparkConf mechanism of creating a SparkContext and the -D flags when executing my jar. I seem to be able to run simple jobs from the spark-shell OK, but my more complicated jobs require external libraries so I need to build jars and execute them. Some code that causes this: println(Creating config) val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(MyApp) .setSparkHome(sparkHome) .set(spark.akka.askTimeout, parsed.getOrElse(timeouts, 100)) .set(spark.akka.timeout, parsed.getOrElse(timeouts, 100)) println(Creating sc) implicit val sc = new SparkContext(conf) The output: Creating config Creating sc log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. [ERROR] [02/26/2014 11:05:25.491] [main] [Remoting] Remoting error: [Startup timed out] [ akka.remote.RemoteTransportException: Startup timed out at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129) at akka.remote.Remoting.start(Remoting.scala:191) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579) at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577) at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588) at akka.actor.ActorSystem$.apply(ActorSystem.scala:111) at akka.actor.ActorSystem$.apply(ActorSystem.scala:104) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126) at org.apache.spark.SparkContext.init(SparkContext.scala:139) at com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40) at com.adbrain.accuracy.EvaluateAdtruthIDs.main(EvaluateAdtruthIDs.scala) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at akka.remote.Remoting.start(Remoting.scala:173) ... 11 more ] Exception in thread main java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at akka.remote.Remoting.start(Remoting.scala:173) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579) at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577) at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588) at akka.actor.ActorSystem$.apply(ActorSystem.scala:111) at akka.actor.ActorSystem$.apply(ActorSystem.scala:104) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96)
[jira] [Commented] (SPARK-1138) Spark 0.9.0 does not work with Hadoop / HDFS
[ https://issues.apache.org/jira/browse/SPARK-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077271#comment-14077271 ] Russell Jurney commented on SPARK-1138: --- See https://github.com/apache/spark/pull/455 Spark 0.9.0 does not work with Hadoop / HDFS Key: SPARK-1138 URL: https://issues.apache.org/jira/browse/SPARK-1138 Project: Spark Issue Type: Bug Reporter: Sam Abeyratne UPDATE: This problem is certainly related to trying to use Spark 0.9.0 and the latest cloudera Hadoop / HDFS in the same jar. It seems no matter how I fiddle with the deps, the do not play nice together. I'm getting a java.util.concurrent.TimeoutException when trying to create a spark context with 0.9. I cannot, whatever I do, change the timeout. I've tried using System.setProperty, the SparkConf mechanism of creating a SparkContext and the -D flags when executing my jar. I seem to be able to run simple jobs from the spark-shell OK, but my more complicated jobs require external libraries so I need to build jars and execute them. Some code that causes this: println(Creating config) val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(MyApp) .setSparkHome(sparkHome) .set(spark.akka.askTimeout, parsed.getOrElse(timeouts, 100)) .set(spark.akka.timeout, parsed.getOrElse(timeouts, 100)) println(Creating sc) implicit val sc = new SparkContext(conf) The output: Creating config Creating sc log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. [ERROR] [02/26/2014 11:05:25.491] [main] [Remoting] Remoting error: [Startup timed out] [ akka.remote.RemoteTransportException: Startup timed out at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129) at akka.remote.Remoting.start(Remoting.scala:191) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579) at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577) at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588) at akka.actor.ActorSystem$.apply(ActorSystem.scala:111) at akka.actor.ActorSystem$.apply(ActorSystem.scala:104) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126) at org.apache.spark.SparkContext.init(SparkContext.scala:139) at com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40) at com.adbrain.accuracy.EvaluateAdtruthIDs.main(EvaluateAdtruthIDs.scala) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at akka.remote.Remoting.start(Remoting.scala:173) ... 11 more ] Exception in thread main java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at akka.remote.Remoting.start(Remoting.scala:173) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579) at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577) at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588) at akka.actor.ActorSystem$.apply(ActorSystem.scala:111) at akka.actor.ActorSystem$.apply(ActorSystem.scala:104) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126) at org.apache.spark.SparkContext.init(SparkContext.scala:139) at com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40) at com.adbrain.accuracy.EvaluateAdtruthIDs.main(EvaluateAdtruthIDs.scala) -- This message was sent by Atlassian JIRA
[jira] [Created] (SPARK-2723) Block Manager should catch exceptions in putValues
Shivaram Venkataraman created SPARK-2723: Summary: Block Manager should catch exceptions in putValues Key: SPARK-2723 URL: https://issues.apache.org/jira/browse/SPARK-2723 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Shivaram Venkataraman The BlockManager should catch exceptions encountered while writing out files to disk. Right now these exceptions get counted as user-level task failures and the job is aborted after failing 4 times. We should either fail the executor or handle this better to prevent the job from dying. I ran into an issue where one disk on a large EC2 cluster failed and this resulted in a long running job terminating. Longer term, we should also look at black-listing local directories when one of them become unusable ? Exception pasted below: 14/07/29 00:55:39 WARN scheduler.TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: /mnt2/spark/spark-local-20140728175256-e7cb/28/broadcast_264_piece20 (Input/output error) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at java.io.FileOutputStream.init(FileOutputStream.java:171) at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:79) at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:66) at org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:847) at org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:267) at org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:256) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.storage.MemoryStore.ensureFreeSpace(MemoryStore.scala:256) at org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:179) at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:76) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:663) at org.apache.spark.storage.BlockManager.put(BlockManager.scala:574) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (SPARK-2512) Stratified sampling
[ https://issues.apache.org/jira/browse/SPARK-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin reopened SPARK-2512: -- Stratified sampling --- Key: SPARK-2512 URL: https://issues.apache.org/jira/browse/SPARK-2512 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Doris Xin PR: https://github.com/apache/spark/pull/1025 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2724) Python version of Random RDD without support for arbitrary distribution
Doris Xin created SPARK-2724: Summary: Python version of Random RDD without support for arbitrary distribution Key: SPARK-2724 URL: https://issues.apache.org/jira/browse/SPARK-2724 Project: Spark Issue Type: Sub-task Reporter: Doris Xin Python version of [SPARK-2514] but without support for randomRDD and randomVectorRDD, which take in any DistributionGenerator objects. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2724) Python version of Random RDD without support for arbitrary distribution
[ https://issues.apache.org/jira/browse/SPARK-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077286#comment-14077286 ] Apache Spark commented on SPARK-2724: - User 'dorx' has created a pull request for this issue: https://github.com/apache/spark/pull/1628 Python version of Random RDD without support for arbitrary distribution --- Key: SPARK-2724 URL: https://issues.apache.org/jira/browse/SPARK-2724 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Doris Xin Assignee: Doris Xin Python version of [SPARK-2514] but without support for randomRDD and randomVectorRDD, which take in any DistributionGenerator objects. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2134) Report metrics before application finishes
[ https://issues.apache.org/jira/browse/SPARK-2134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2134: - Assignee: Rahul Singhal Report metrics before application finishes -- Key: SPARK-2134 URL: https://issues.apache.org/jira/browse/SPARK-2134 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Rahul Singhal Assignee: Rahul Singhal Priority: Minor Metric values could have been updated after they were last reported. These last updated values may be useful but they will never be reported if the application itself finishes. A simple solution is to update/report all the sinks before stopping the MetricSystem. The problem is that metric system may be dependent on some other component which may have already been stopped. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2726) Remove SortOrder in ShuffleDependency and HashShuffleReader
Reynold Xin created SPARK-2726: -- Summary: Remove SortOrder in ShuffleDependency and HashShuffleReader Key: SPARK-2726 URL: https://issues.apache.org/jira/browse/SPARK-2726 Project: Spark Issue Type: Bug Reporter: Reynold Xin Assignee: Reynold Xin SPARK-2125 introduced a SortOrder in ShuffleDependency and HashShuffleReader. However, the key ordering already includes the SortOrder information since an Ordering can be reversed easily. This is similar to Java's Comparator interface. Rarely does an API accept both a Comparator as well as an SortOrder. We should remove the SortOrder. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2727) HashShuffleReader should do in-place sort
Reynold Xin created SPARK-2727: -- Summary: HashShuffleReader should do in-place sort Key: SPARK-2727 URL: https://issues.apache.org/jira/browse/SPARK-2727 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.0.3 Reporter: Reynold Xin Assignee: Reynold Xin HashShuffleReader uses sortWith to sort an array, which creates a copy of the array. We can use an in-place sort algorithm to reduce the memory overhead. -- This message was sent by Atlassian JIRA (v6.2#6252)