[jira] [Created] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
Kousuke Saruta created SPARK-3399: - Summary: Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR Key: SPARK-3399 URL: https://issues.apache.org/jira/browse/SPARK-3399 Project: Spark Issue Type: Bug Components: PySpark Reporter: Kousuke Saruta Some tests for PySpark make temporary files on /tmp of local file system but if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in spark-env.sh, tests expects temporary files are on FileSystem configured in core-site.xml even though actual files are on local file system. I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR. If we need those variables in some tests, we should set those variables in such tests specially. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121016#comment-14121016 ] Apache Spark commented on SPARK-3399: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2270 Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR Key: SPARK-3399 URL: https://issues.apache.org/jira/browse/SPARK-3399 Project: Spark Issue Type: Bug Components: PySpark Reporter: Kousuke Saruta Some tests for PySpark make temporary files on /tmp of local file system but if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in spark-env.sh, tests expects temporary files are on FileSystem configured in core-site.xml even though actual files are on local file system. I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR. If we need those variables in some tests, we should set those variables in such tests specially. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3400) GraphX unit tests fail nondeterministically
Ankur Dave created SPARK-3400: - Summary: GraphX unit tests fail nondeterministically Key: SPARK-3400 URL: https://issues.apache.org/jira/browse/SPARK-3400 Project: Spark Issue Type: Bug Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave Priority: Blocker GraphX unit tests have been failing since the fix to SPARK-2823 was merged: https://github.com/apache/spark/commit/9b225ac3072de522b40b46aba6df1f1c231f13ef. Failures have appeared as Snappy parsing errors and shuffle FileNotFoundExceptions. A local test showed that these failures occurred in about 3/10 test runs. Reverting the mentioned commit seems to solve the problem. Since this is blocking everyone else, I'm submitting a hotfix to do that, and we can diagnose the problem in more detail afterwards. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3401) Wrong usage of tee command in python/run-tests
Kousuke Saruta created SPARK-3401: - Summary: Wrong usage of tee command in python/run-tests Key: SPARK-3401 URL: https://issues.apache.org/jira/browse/SPARK-3401 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Kousuke Saruta In python/run-test, tee command is used with -a option to append unit-tests.log for logging but the usage is wrong. In current implementation, the output of tee command is redirected to unit-tests.log like tee -a unit-tests.log. tee command is not needed to redirect its output. This issue affects invalid truncate of unit-tests.log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3401) Wrong usage of tee command in python/run-tests
[ https://issues.apache.org/jira/browse/SPARK-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121033#comment-14121033 ] Apache Spark commented on SPARK-3401: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2272 Wrong usage of tee command in python/run-tests -- Key: SPARK-3401 URL: https://issues.apache.org/jira/browse/SPARK-3401 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Kousuke Saruta In python/run-test, tee command is used with -a option to append unit-tests.log for logging but the usage is wrong. In current implementation, the output of tee command is redirected to unit-tests.log like tee -a unit-tests.log. tee command is not needed to redirect its output. This issue affects invalid truncate of unit-tests.log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3400) GraphX unit tests fail nondeterministically
[ https://issues.apache.org/jira/browse/SPARK-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121032#comment-14121032 ] Apache Spark commented on SPARK-3400: - User 'ankurdave' has created a pull request for this issue: https://github.com/apache/spark/pull/2271 GraphX unit tests fail nondeterministically --- Key: SPARK-3400 URL: https://issues.apache.org/jira/browse/SPARK-3400 Project: Spark Issue Type: Bug Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave Priority: Blocker Fix For: 1.0.1, 1.1.0, 1.3.0 GraphX unit tests have been failing since the fix to SPARK-2823 was merged: https://github.com/apache/spark/commit/9b225ac3072de522b40b46aba6df1f1c231f13ef. Failures have appeared as Snappy parsing errors and shuffle FileNotFoundExceptions. A local test showed that these failures occurred in about 3/10 test runs. Reverting the mentioned commit seems to solve the problem. Since this is blocking everyone else, I'm submitting a hotfix to do that, and we can diagnose the problem in more detail afterwards. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3400) GraphX unit tests fail nondeterministically
[ https://issues.apache.org/jira/browse/SPARK-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-3400. --- Resolution: Fixed Fix Version/s: 1.1.0 1.3.0 1.0.1 Issue resolved by pull request 2271 [https://github.com/apache/spark/pull/2271] GraphX unit tests fail nondeterministically --- Key: SPARK-3400 URL: https://issues.apache.org/jira/browse/SPARK-3400 Project: Spark Issue Type: Bug Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave Priority: Blocker Fix For: 1.0.1, 1.3.0, 1.1.0 GraphX unit tests have been failing since the fix to SPARK-2823 was merged: https://github.com/apache/spark/commit/9b225ac3072de522b40b46aba6df1f1c231f13ef. Failures have appeared as Snappy parsing errors and shuffle FileNotFoundExceptions. A local test showed that these failures occurred in about 3/10 test runs. Reverting the mentioned commit seems to solve the problem. Since this is blocking everyone else, I'm submitting a hotfix to do that, and we can diagnose the problem in more detail afterwards. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-2823) GraphX jobs throw IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave reopened SPARK-2823: --- I had to revert this because of SPARK-3400. GraphX jobs throw IllegalArgumentException -- Key: SPARK-2823 URL: https://issues.apache.org/jira/browse/SPARK-2823 Project: Spark Issue Type: Bug Components: GraphX Reporter: Lu Lu Fix For: 1.1.1, 1.2.0, 1.0.3 If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition number, GraphX jobs will throw IllegalArgumentException: 14/07/26 21:06:51 WARN DAGScheduler: Creating new stage failed due to exception - job: 1 java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:54) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:1 97) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:272) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:274) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:274) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:279) at org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:219) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:672) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1184) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Commented] (SPARK-3353) Stage id monotonicity (parent stage should have lower stage id)
[ https://issues.apache.org/jira/browse/SPARK-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121075#comment-14121075 ] Apache Spark commented on SPARK-3353: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2273 Stage id monotonicity (parent stage should have lower stage id) --- Key: SPARK-3353 URL: https://issues.apache.org/jira/browse/SPARK-3353 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Reynold Xin The way stage IDs are generated is that parent stages actually have higher stage id. This is very confusing because parent stages get scheduled executed first. We should reverse that order so the scheduling timeline of stages (absent of failures) is monotonic, i.e. stages that are executed first have lower stage ids. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121085#comment-14121085 ] Chengxiang Li commented on SPARK-2321: -- I collect some hive side requirement here, which should be helpful for spark job status and statistic API design. Hive should be able to get the following job status information through Spark job status API. 1. job identifier 2. current job execution state, should include RUNNING/SUCCEEDED/FAILED/KILLED. 3. running/failed/killed/total task number on job level. 4. stage identifier 5. stage state, should include RUNNING/SUCCEEDED/FAILED/KILLED 6. running/failed/killed/total task number on stage level. MR/Tez use Counter to collect statistic information, similiar to MR/Tez Counter, it would be better if Spark job statistic API organize statistic information with: 1. group same kind statistic information by groupName. 2. displayName for both group and statistic information which would uniform print string for frontend(Web UI/Hive CLI/...). Design a proper progress reporting event listener API --- Key: SPARK-2321 URL: https://issues.apache.org/jira/browse/SPARK-2321 Project: Spark Issue Type: Improvement Components: Java API, Spark Core Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical This is a ticket to track progress on redesigning the SparkListener and JobProgressListener API. There are multiple problems with the current design, including: 0. I'm not sure if the API is usable in Java (there are at least some enums we used in Scala and a bunch of case classes that might complicate things). 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of attention to it yet. Something as important as progress reporting deserves a more stable API. 2. There is no easy way to connect jobs with stages. Similarly, there is no easy way to connect job groups with jobs / stages. 3. JobProgressListener itself has no encapsulation at all. States can be arbitrarily mutated by external programs. Variable names are sort of randomly decided and inconsistent. We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121094#comment-14121094 ] Reynold Xin commented on SPARK-2321: What about pull vs push? i.e. should this be a listener like API, or some service with states that the caller can poll to ask? Design a proper progress reporting event listener API --- Key: SPARK-2321 URL: https://issues.apache.org/jira/browse/SPARK-2321 Project: Spark Issue Type: Improvement Components: Java API, Spark Core Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical This is a ticket to track progress on redesigning the SparkListener and JobProgressListener API. There are multiple problems with the current design, including: 0. I'm not sure if the API is usable in Java (there are at least some enums we used in Scala and a bunch of case classes that might complicate things). 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of attention to it yet. Something as important as progress reporting deserves a more stable API. 2. There is no easy way to connect jobs with stages. Similarly, there is no easy way to connect job groups with jobs / stages. 3. JobProgressListener itself has no encapsulation at all. States can be arbitrarily mutated by external programs. Variable names are sort of randomly decided and inconsistent. We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121115#comment-14121115 ] Apache Spark commented on SPARK-2978: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/2274 Provide an MR-style shuffle transformation -- Key: SPARK-2978 URL: https://issues.apache.org/jira/browse/SPARK-2978 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Sandy Ryza For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe? * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3377) Don't mix metrics from different applications
[ https://issues.apache.org/jira/browse/SPARK-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3377: -- Summary: Don't mix metrics from different applications (was: codahale base Metrics data between applications can jumble up together) Don't mix metrics from different applications - Key: SPARK-3377 URL: https://issues.apache.org/jira/browse/SPARK-3377 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Critical I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw following 2 problems. (1) When applications which have same spark.app.name run on cluster at the same time, some metrics names jumble up together. e.g, SparkPi.DAGScheduler.stage.failedStages jumble. (2) When 2+ executors run on the same machine, JVM metrics of each executors jumble. e.g, We current implementation cannot distinguish metric jvm.memory is for which executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-640) Update Hadoop 1 version to 1.1.0 (especially on AMIs)
[ https://issues.apache.org/jira/browse/SPARK-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-640. - Resolution: Fixed This looks stale right? Hadoop 1 version has been at 1.2.1 for some time. Update Hadoop 1 version to 1.1.0 (especially on AMIs) - Key: SPARK-640 URL: https://issues.apache.org/jira/browse/SPARK-640 Project: Spark Issue Type: New Feature Reporter: Matei Zaharia Hadoop 1.1.0 has a fix to the notorious trailing slash for directory objects in S3 issue: https://issues.apache.org/jira/browse/HADOOP-5836, so would be good to support on the AMIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-529) Have a single file that controls the environmental variables and spark config options
[ https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-529. - Resolution: Won't Fix This looks obsolete and/or fixed, as variables like SPARK_MEM are deprecated, and I suppose there is spark-env.sh too. Have a single file that controls the environmental variables and spark config options - Key: SPARK-529 URL: https://issues.apache.org/jira/browse/SPARK-529 Project: Spark Issue Type: Improvement Reporter: Reynold Xin E.g. multiple places in the code base uses SPARK_MEM and has its own default set to 512. We need a central place to enforce default values as well as documenting the variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121160#comment-14121160 ] Saisai Shao commented on SPARK-3129: Hi [~hshreedharan], one more question: Is your design goal trying to fix the receiver node failure caused data loss issue? Seems potentially data will be lost when data is only stored in BlockGenerator not yet in BM when node is failed. Your design doc mainly focused on driver failure, so what's your thought? Prevent data loss in Spark Streaming Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Reporter: Hari Shreedharan Assignee: Hari Shreedharan Attachments: StreamingPreventDataLoss.pdf Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121173#comment-14121173 ] Chengxiang Li commented on SPARK-2321: -- I'm not sure whether i understand you right, here is my thought about the API design: # The JobStatus/JobStatistic API only contains getter method. # JobProgressListener contains variables of JobStatusImpl/JobStatisticImpl. # DagScheduler post events to JobProgressListener through listener bus. # Caller get JobStatusImpl/JobStatisticImpl from JobProgressListener with updated state. So i think it should be a pull style API. Design a proper progress reporting event listener API --- Key: SPARK-2321 URL: https://issues.apache.org/jira/browse/SPARK-2321 Project: Spark Issue Type: Improvement Components: Java API, Spark Core Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical This is a ticket to track progress on redesigning the SparkListener and JobProgressListener API. There are multiple problems with the current design, including: 0. I'm not sure if the API is usable in Java (there are at least some enums we used in Scala and a bunch of case classes that might complicate things). 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of attention to it yet. Something as important as progress reporting deserves a more stable API. 2. There is no easy way to connect jobs with stages. Similarly, there is no easy way to connect job groups with jobs / stages. 3. JobProgressListener itself has no encapsulation at all. States can be arbitrarily mutated by external programs. Variable names are sort of randomly decided and inconsistent. We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3402) Library for Natural Language Processing over Spark.
Nagamallikarjuna created SPARK-3402: --- Summary: Library for Natural Language Processing over Spark. Key: SPARK-3402 URL: https://issues.apache.org/jira/browse/SPARK-3402 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Nagamallikarjuna Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3402) Library for Natural Language Processing over Spark.
[ https://issues.apache.org/jira/browse/SPARK-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121190#comment-14121190 ] Nagamallikarjuna commented on SPARK-3402: - We have gone through Spark and its family, we didn't find any natural language processing library over spark. We (Impetus) are working to implement some natural language features over Spark. We already developed some working algorithms library using OpenNLP tool kit, and will extend to other NLP tool kits like Stanford, CTakes, NLTK etc.. We are planning to contribute our work to existing MLLib or new sub project. Thanks Naga Library for Natural Language Processing over Spark. --- Key: SPARK-3402 URL: https://issues.apache.org/jira/browse/SPARK-3402 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Nagamallikarjuna Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121276#comment-14121276 ] David commented on SPARK-1473: -- Hi you all, I am Dr. David Martinez and this is my first comment of this project. We implemented all feature selection methods included in •Brown, G., Pocock, A., Zhao, M. J., Luján, M. (2012). Conditional likelihood maximisation: a unifying framework for information theoretic feature selection.The Journal of Machine Learning Research, 13, 27-66 included more optimizations and left the framework open to include more criteria. We opened a pull request in the past but did not finished it. You can have a look in our github https://github.com/LIDIAgroup/SparkFeatureSelection We would like to finish our pull request Feature selection for high dimensional datasets --- Key: SPARK-1473 URL: https://issues.apache.org/jira/browse/SPARK-1473 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ignacio Zendejas Assignee: Alexander Ulanov Priority: Minor Labels: features For classification tasks involving large feature spaces in the order of tens of thousands or higher (e.g., text classification with n-grams, where n 1), it is often useful to rank and filter features that are irrelevant thereby reducing the feature space by at least one or two orders of magnitude without impacting performance on key evaluation metrics (accuracy/precision/recall). A feature evaluation interface which is flexible needs to be designed and at least two methods should be implemented with Information Gain being a priority as it has been shown to be amongst the most reliable. Special consideration should be taken in the design to account for wrapper methods (see research papers below) which are more practical for lower dimensional data. Relevant research: * Brown, G., Pocock, A., Zhao, M. J., Luján, M. (2012). Conditional likelihood maximisation: a unifying framework for information theoretic feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. * Forman, George. An extensive empirical study of feature selection metrics for text classification. The Journal of machine learning research 3 (2003): 1289-1305. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped
[ https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121324#comment-14121324 ] Helena Edelson commented on SPARK-2892: --- I see the same with 1.0.2 streaming: ERROR 08:26:21,139 Deregistered receiver for stream 0: Stopped by driver WARN 08:26:21,211 Stopped executor without error WARN 08:26:21,213 All of the receivers have not deregistered, Map(0 - ReceiverInfo(0,ActorReceiver-0,null,false,host,Stopped by driver,)) Socket Receiver does not stop when streaming context is stopped --- Key: SPARK-2892 URL: https://issues.apache.org/jira/browse/SPARK-2892 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.2 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Running NetworkWordCount with {quote} ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); Thread.sleep(6) {quote} gives the following error {quote} 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 10047 ms on localhost (1/1) 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at ReceiverTracker.scala:275) finished in 10.056 s 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at ReceiverTracker.scala:275, took 10.179263 s 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been terminated 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not deregistered, Map(0 - ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,)) 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after time 1407375433000 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost: {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped
[ https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121324#comment-14121324 ] Helena Edelson edited comment on SPARK-2892 at 9/4/14 1:12 PM: --- I see the same with 1.0.2 streaming, with or without stopGracefully = true ssc.stop(stopSparkContext = false, stopGracefully = true) ERROR 08:26:21,139 Deregistered receiver for stream 0: Stopped by driver WARN 08:26:21,211 Stopped executor without error WARN 08:26:21,213 All of the receivers have not deregistered, Map(0 - ReceiverInfo(0,ActorReceiver-0,null,false,host,Stopped by driver,)) was (Author: helena_e): I see the same with 1.0.2 streaming: ERROR 08:26:21,139 Deregistered receiver for stream 0: Stopped by driver WARN 08:26:21,211 Stopped executor without error WARN 08:26:21,213 All of the receivers have not deregistered, Map(0 - ReceiverInfo(0,ActorReceiver-0,null,false,host,Stopped by driver,)) Socket Receiver does not stop when streaming context is stopped --- Key: SPARK-2892 URL: https://issues.apache.org/jira/browse/SPARK-2892 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.2 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Running NetworkWordCount with {quote} ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); Thread.sleep(6) {quote} gives the following error {quote} 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 10047 ms on localhost (1/1) 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at ReceiverTracker.scala:275) finished in 10.056 s 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at ReceiverTracker.scala:275, took 10.179263 s 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been terminated 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not deregistered, Map(0 - ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,)) 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after time 1407375433000 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost: {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3375) spark on yarn container allocation issues
[ https://issues.apache.org/jira/browse/SPARK-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-3375: Assignee: Thomas Graves spark on yarn container allocation issues - Key: SPARK-3375 URL: https://issues.apache.org/jira/browse/SPARK-3375 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Blocker It looks like if yarn doesn't get the containers immediately it stops asking for them and the yarn application hangs with never getting any executors. This was introduced by https://github.com/apache/spark/pull/2169 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3375) spark on yarn container allocation issues
[ https://issues.apache.org/jira/browse/SPARK-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121451#comment-14121451 ] Apache Spark commented on SPARK-3375: - User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/2275 spark on yarn container allocation issues - Key: SPARK-3375 URL: https://issues.apache.org/jira/browse/SPARK-3375 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Blocker It looks like if yarn doesn't get the containers immediately it stops asking for them and the yarn application hangs with never getting any executors. This was introduced by https://github.com/apache/spark/pull/2169 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
Alexander Ulanov created SPARK-3403: --- Summary: NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.1.0 Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-3403: Attachment: NativeNN.scala The file contains example that produces the same issue NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.1.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121516#comment-14121516 ] Xiangrui Meng commented on SPARK-3403: -- Did you test the setup of netlib-java with OpenBLAS? I hit a JNI issue (a year ago, maybe fixed) with netlib-java and multithreading OpenBLAS. Could you try compiling OpenBLAS with `USE_THREAD=0`? If it still doesn't work, please attach the driver/executor logs. Thanks! NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.1.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3404) SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1
Sean Owen created SPARK-3404: Summary: SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1 Key: SPARK-3404 URL: https://issues.apache.org/jira/browse/SPARK-3404 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Maven-based Jenkins builds have been failing for over a month. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ It's SparkSubmitSuite that fails. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull {code} SparkSubmitSuite ... - launch simple application with spark-submit *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - spark submit includes jars passed in through --jar *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, local-cluster[2,1,512], --jars, file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... {code} SBT builds don't fail, so it is likely to be due to some difference in how the tests are run rather than a problem with test or core project. This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the cause identified in that JIRA is, at least, not the only cause. (Although, it wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins config to invoke {{mvn clean mvn ... package}} {{mvn ... clean package}}.) This JIRA tracks investigation into a different cause. Right now I have some further information but not a PR yet. Part of the issue is that there is no clue in the log about why {{spark-submit}} exited with status 1. See https://github.com/apache/spark/pull/2108/files and https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at least print stdout to the log too. The SparkSubmit program exits with 1 when the main class it is supposed to run is not found (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322) This is for example SimpleApplicationTest (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339) The test actually submits an empty JAR not containing this class. It relies on {{spark-submit}} finding the class within the compiled test-classes of the Spark project. However it does seem to be compiled and present even with Maven. If modified to print stdout and stderr, and dump the actual command, I see an empty stdout, and only the command to stderr: {code} Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home/bin/java -cp
[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped
[ https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121561#comment-14121561 ] Helena Edelson commented on SPARK-2892: --- I wonder if the ERROR should be a WARN or INFO since it occurs as a result of ReceiverSupervisorImpl receiving a StopReceiver, and Deregistered receiver for stream seems like the expected behavior. Socket Receiver does not stop when streaming context is stopped --- Key: SPARK-2892 URL: https://issues.apache.org/jira/browse/SPARK-2892 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.2 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Running NetworkWordCount with {quote} ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); Thread.sleep(6) {quote} gives the following error {quote} 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 10047 ms on localhost (1/1) 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at ReceiverTracker.scala:275) finished in 10.056 s 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at ReceiverTracker.scala:275, took 10.179263 s 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been terminated 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not deregistered, Map(0 - ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,)) 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after time 1407375433000 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost: {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121563#comment-14121563 ] Alexander Ulanov commented on SPARK-3403: - Yes, I tried using netlib-java separately with the same OpenBLAS setup and it worked properly, even within several threads. However I didn't mimic the same multi-threading setup as MLlib has because it is complicated. Do you want me to send you all DLLs that I used? I had troubles with compiling OpenBLAS for Windows so I used precompiled x64 versions from OpenBLAS and MinGW64 websites. NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.1.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped
[ https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121561#comment-14121561 ] Helena Edelson edited comment on SPARK-2892 at 9/4/14 5:01 PM: --- I wonder if the ERROR should be a WARN or INFO since it occurs as a result of ReceiverSupervisorImpl receiving a StopReceiver, and Deregistered receiver for stream seems like the expected behavior. DEBUG 13:00:22,418 Stopping JobScheduler INFO 13:00:22,441 Received stop signal INFO 13:00:22,441 Sent stop signal to all 1 receivers INFO 13:00:22,442 Stopping receiver with message: Stopped by driver: INFO 13:00:22,442 Called receiver onStop INFO 13:00:22,443 Deregistering receiver 0 ERROR 13:00:22,445 Deregistered receiver for stream 0: Stopped by driver INFO 13:00:22,445 Stopped receiver 0 was (Author: helena_e): I wonder if the ERROR should be a WARN or INFO since it occurs as a result of ReceiverSupervisorImpl receiving a StopReceiver, and Deregistered receiver for stream seems like the expected behavior. Socket Receiver does not stop when streaming context is stopped --- Key: SPARK-2892 URL: https://issues.apache.org/jira/browse/SPARK-2892 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.2 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Running NetworkWordCount with {quote} ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); Thread.sleep(6) {quote} gives the following error {quote} 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 10047 ms on localhost (1/1) 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at ReceiverTracker.scala:275) finished in 10.056 s 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at ReceiverTracker.scala:275, took 10.179263 s 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been terminated 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not deregistered, Map(0 - ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,)) 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after time 1407375433000 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost: {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https
[ https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121608#comment-14121608 ] Apache Spark commented on SPARK-3286: - User 'benoyantony' has created a pull request for this issue: https://github.com/apache/spark/pull/2276 Cannot view ApplicationMaster UI when Yarn’s url scheme is https Key: SPARK-3286 URL: https://issues.apache.org/jira/browse/SPARK-3286 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Benoy Antony Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch The spark Application Master starts its web UI at http://host-name:port. When Spark ApplicationMaster registers its URL with Resource Manager , the URL does not contain URI scheme. If the URL scheme is absent, Resource Manager’s web app proxy will use the HTTP Policy of the Resource Manager.(YARN-1553) If the HTTP Policy of the Resource Manager is https, then web app proxy will try to access https://host-name:port. This will result in error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3404) SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121646#comment-14121646 ] Andrew Or commented on SPARK-3404: -- Thanks for looking into this Sean. Does this happen all the time or only once in a while? We have observed the same tests failing on our Jenkins, which runs the test through sbt. The behavior is consistent with running it through maven. If we run it through 'sbt test-only SparkSubmitSuite' then it always passes, but if we run 'sbt test' then sometimes it fails. This has also been failing for a while for sbt. Very roughly I remember we began seeing it after https://github.com/apache/spark/pull/1777 went in. Though I have gone down that path to debug any possibilities of port collision to no avail. A related test failure is in DriverSuite, which also calls `Utils.executeAndGetOutput`. Have you seen that failing in maven? I will keep investigating it in parallel for sbt, though I suspect the root cause is the same. Let me know if you find anything. SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1 --- Key: SPARK-3404 URL: https://issues.apache.org/jira/browse/SPARK-3404 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Maven-based Jenkins builds have been failing for over a month. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ It's SparkSubmitSuite that fails. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull {code} SparkSubmitSuite ... - launch simple application with spark-submit *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - spark submit includes jars passed in through --jar *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, local-cluster[2,1,512], --jars, file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... {code} SBT builds don't fail, so it is likely to be due to some difference in how the tests are run rather than a problem with test or core project. This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the cause identified in that JIRA is, at least, not the only cause. (Although, it wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins config to invoke {{mvn clean mvn ... package}} {{mvn ... clean package}}.) This JIRA tracks investigation into a different cause. Right now I have some further information but not a PR yet. Part of the issue is that there is no clue in the log about why {{spark-submit}} exited with status 1. See https://github.com/apache/spark/pull/2108/files and
[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3404: - Priority: Critical (was: Major) SparkSubmitSuite fails with spark-submit exits with code 1 Key: SPARK-3404 URL: https://issues.apache.org/jira/browse/SPARK-3404 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2, 1.1.0 Reporter: Sean Owen Priority: Critical Maven-based Jenkins builds have been failing for over a month. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ It's SparkSubmitSuite that fails. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull {code} SparkSubmitSuite ... - launch simple application with spark-submit *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - spark submit includes jars passed in through --jar *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, local-cluster[2,1,512], --jars, file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... {code} SBT builds don't fail, so it is likely to be due to some difference in how the tests are run rather than a problem with test or core project. This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the cause identified in that JIRA is, at least, not the only cause. (Although, it wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins config to invoke {{mvn clean mvn ... package}} {{mvn ... clean package}}.) This JIRA tracks investigation into a different cause. Right now I have some further information but not a PR yet. Part of the issue is that there is no clue in the log about why {{spark-submit}} exited with status 1. See https://github.com/apache/spark/pull/2108/files and https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at least print stdout to the log too. The SparkSubmit program exits with 1 when the main class it is supposed to run is not found (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322) This is for example SimpleApplicationTest (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339) The test actually submits an empty JAR not containing this class. It relies on {{spark-submit}} finding the class within the compiled test-classes of the Spark project. However it does seem to be compiled and present even with Maven. If modified to print stdout and stderr, and dump the actual command, I see an empty stdout, and only the command to stderr: {code} Spark Command:
[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3404: - Summary: SparkSubmitSuite fails with spark-submit exits with code 1 (was: SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1) SparkSubmitSuite fails with spark-submit exits with code 1 Key: SPARK-3404 URL: https://issues.apache.org/jira/browse/SPARK-3404 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2, 1.1.0 Reporter: Sean Owen Maven-based Jenkins builds have been failing for over a month. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ It's SparkSubmitSuite that fails. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull {code} SparkSubmitSuite ... - launch simple application with spark-submit *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - spark submit includes jars passed in through --jar *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, local-cluster[2,1,512], --jars, file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... {code} SBT builds don't fail, so it is likely to be due to some difference in how the tests are run rather than a problem with test or core project. This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the cause identified in that JIRA is, at least, not the only cause. (Although, it wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins config to invoke {{mvn clean mvn ... package}} {{mvn ... clean package}}.) This JIRA tracks investigation into a different cause. Right now I have some further information but not a PR yet. Part of the issue is that there is no clue in the log about why {{spark-submit}} exited with status 1. See https://github.com/apache/spark/pull/2108/files and https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at least print stdout to the log too. The SparkSubmit program exits with 1 when the main class it is supposed to run is not found (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322) This is for example SimpleApplicationTest (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339) The test actually submits an empty JAR not containing this class. It relies on {{spark-submit}} finding the class within the compiled test-classes of the Spark project. However it does seem to be compiled and present even with Maven. If modified to print stdout and stderr, and dump the actual command, I see an empty stdout, and
[jira] [Commented] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121650#comment-14121650 ] Andrew Or commented on SPARK-3404: -- I have updated the title to reflect this. SparkSubmitSuite fails with spark-submit exits with code 1 Key: SPARK-3404 URL: https://issues.apache.org/jira/browse/SPARK-3404 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2, 1.1.0 Reporter: Sean Owen Priority: Critical Maven-based Jenkins builds have been failing for over a month. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ It's SparkSubmitSuite that fails. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull {code} SparkSubmitSuite ... - launch simple application with spark-submit *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - spark submit includes jars passed in through --jar *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, local-cluster[2,1,512], --jars, file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... {code} SBT builds don't fail, so it is likely to be due to some difference in how the tests are run rather than a problem with test or core project. This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the cause identified in that JIRA is, at least, not the only cause. (Although, it wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins config to invoke {{mvn clean mvn ... package}} {{mvn ... clean package}}.) This JIRA tracks investigation into a different cause. Right now I have some further information but not a PR yet. Part of the issue is that there is no clue in the log about why {{spark-submit}} exited with status 1. See https://github.com/apache/spark/pull/2108/files and https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at least print stdout to the log too. The SparkSubmit program exits with 1 when the main class it is supposed to run is not found (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322) This is for example SimpleApplicationTest (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339) The test actually submits an empty JAR not containing this class. It relies on {{spark-submit}} finding the class within the compiled test-classes of the Spark project. However it does seem to be compiled and present even with Maven. If modified to print stdout and stderr, and dump the actual command, I see an empty stdout, and only the command to
[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3404: - Target Version/s: 1.1.1 SparkSubmitSuite fails with spark-submit exits with code 1 Key: SPARK-3404 URL: https://issues.apache.org/jira/browse/SPARK-3404 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2, 1.1.0 Reporter: Sean Owen Priority: Critical Maven-based Jenkins builds have been failing for over a month. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ It's SparkSubmitSuite that fails. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull {code} SparkSubmitSuite ... - launch simple application with spark-submit *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - spark submit includes jars passed in through --jar *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, local-cluster[2,1,512], --jars, file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... {code} SBT builds don't fail, so it is likely to be due to some difference in how the tests are run rather than a problem with test or core project. This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the cause identified in that JIRA is, at least, not the only cause. (Although, it wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins config to invoke {{mvn clean mvn ... package}} {{mvn ... clean package}}.) This JIRA tracks investigation into a different cause. Right now I have some further information but not a PR yet. Part of the issue is that there is no clue in the log about why {{spark-submit}} exited with status 1. See https://github.com/apache/spark/pull/2108/files and https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at least print stdout to the log too. The SparkSubmit program exits with 1 when the main class it is supposed to run is not found (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322) This is for example SimpleApplicationTest (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339) The test actually submits an empty JAR not containing this class. It relies on {{spark-submit}} finding the class within the compiled test-classes of the Spark project. However it does seem to be compiled and present even with Maven. If modified to print stdout and stderr, and dump the actual command, I see an empty stdout, and only the command to stderr: {code} Spark Command:
[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3404: - Affects Version/s: 1.1.0 SparkSubmitSuite fails with spark-submit exits with code 1 Key: SPARK-3404 URL: https://issues.apache.org/jira/browse/SPARK-3404 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2, 1.1.0 Reporter: Sean Owen Priority: Critical Maven-based Jenkins builds have been failing for over a month. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ It's SparkSubmitSuite that fails. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull {code} SparkSubmitSuite ... - launch simple application with spark-submit *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - spark submit includes jars passed in through --jar *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, local-cluster[2,1,512], --jars, file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... {code} SBT builds don't fail, so it is likely to be due to some difference in how the tests are run rather than a problem with test or core project. This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the cause identified in that JIRA is, at least, not the only cause. (Although, it wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins config to invoke {{mvn clean mvn ... package}} {{mvn ... clean package}}.) This JIRA tracks investigation into a different cause. Right now I have some further information but not a PR yet. Part of the issue is that there is no clue in the log about why {{spark-submit}} exited with status 1. See https://github.com/apache/spark/pull/2108/files and https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at least print stdout to the log too. The SparkSubmit program exits with 1 when the main class it is supposed to run is not found (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322) This is for example SimpleApplicationTest (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339) The test actually submits an empty JAR not containing this class. It relies on {{spark-submit}} finding the class within the compiled test-classes of the Spark project. However it does seem to be compiled and present even with Maven. If modified to print stdout and stderr, and dump the actual command, I see an empty stdout, and only the command to stderr: {code} Spark Command:
[jira] [Commented] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121653#comment-14121653 ] Sean Owen commented on SPARK-3404: -- It's 100% repeatable in Maven for me locally, which seems to be Jenkins' experience too. I don't see the same problem with SBT (/dev/run-tests) locally, although I can't say I run that regularly. I could rewrite the SparkSubmitSuite to submit a JAR file that actually contains the class it's trying to invoke. Maybe that's smarter? the problem here seems to be the vagaries of what the run-time classpath is during an SBT vs Maven test. Would anyone second that? Separately it would probably not hurt to get in that change that logs stdout / stderr from the Utils method. SparkSubmitSuite fails with spark-submit exits with code 1 Key: SPARK-3404 URL: https://issues.apache.org/jira/browse/SPARK-3404 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2, 1.1.0 Reporter: Sean Owen Priority: Critical Maven-based Jenkins builds have been failing for over a month. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ It's SparkSubmitSuite that fails. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull {code} SparkSubmitSuite ... - launch simple application with spark-submit *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - spark submit includes jars passed in through --jar *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, local-cluster[2,1,512], --jars, file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... {code} SBT builds don't fail, so it is likely to be due to some difference in how the tests are run rather than a problem with test or core project. This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the cause identified in that JIRA is, at least, not the only cause. (Although, it wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins config to invoke {{mvn clean mvn ... package}} {{mvn ... clean package}}.) This JIRA tracks investigation into a different cause. Right now I have some further information but not a PR yet. Part of the issue is that there is no clue in the log about why {{spark-submit}} exited with status 1. See https://github.com/apache/spark/pull/2108/files and https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at least print stdout to the log too. The SparkSubmit program exits with 1 when the main class it is supposed to run is not found
[jira] [Resolved] (SPARK-1078) Replace lift-json with json4s-jackson
[ https://issues.apache.org/jira/browse/SPARK-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-1078. --- Resolution: Fixed Fix Version/s: 1.0.0 It looks like this was fixed in SPARK-1132 / Spark 1.0.0, where we migrated to json4s.jackson. Replace lift-json with json4s-jackson - Key: SPARK-1078 URL: https://issues.apache.org/jira/browse/SPARK-1078 Project: Spark Issue Type: Task Components: Deploy, Web UI Affects Versions: 0.9.0 Reporter: William Benton Priority: Minor Fix For: 1.0.0 json4s-jackson is a Jackson-backed implementation of the Json4s common JSON API for Scala JSON libraries. (Evan Chan has a nice comparison of Scala JSON libraries here: http://engineering.ooyala.com/blog/comparing-scala-json-libraries) It is Apache-licensed, mostly API-compatible with lift-json, and easier for downstream operating system distributions to consume than lift-json. In terms of performance, json4s-jackson is slightly slower but comparable to lift-json on my machine when parsing very small JSON files ( 2kb and ~30 objects), around 40% faster than lift-json on medium-sized files (~50kb), and significantly (~10x) faster on multi-megabyte files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https
[ https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3286: -- Component/s: Web UI Cannot view ApplicationMaster UI when Yarn’s url scheme is https Key: SPARK-3286 URL: https://issues.apache.org/jira/browse/SPARK-3286 Project: Spark Issue Type: Bug Components: Web UI, YARN Affects Versions: 1.0.2 Reporter: Benoy Antony Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch The spark Application Master starts its web UI at http://host-name:port. When Spark ApplicationMaster registers its URL with Resource Manager , the URL does not contain URI scheme. If the URL scheme is absent, Resource Manager’s web app proxy will use the HTTP Policy of the Resource Manager.(YARN-1553) If the HTTP Policy of the Resource Manager is https, then web app proxy will try to access https://host-name:port. This will result in error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3284) saveAsParquetFile not working on windows
[ https://issues.apache.org/jira/browse/SPARK-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pravesh Jain updated SPARK-3284: Description: {code} object parquet { case class Person(name: String, age: Int) def main(args: Array[String]) { val sparkConf = new SparkConf().setMaster(local).setAppName(HdfsWordCount) val sc = new SparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD. import sqlContext.createSchemaRDD val people = sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) val parquetFile = sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) } } {code} gives the error Exception in thread main java.lang.NullPointerException at org.apache.spark.parquet$.main(parquet.scala:16) which is the line saveAsParquetFile. This works fine in linux but using in eclipse in windows gives the error. was: object parquet { case class Person(name: String, age: Int) def main(args: Array[String]) { val sparkConf = new SparkConf().setMaster(local).setAppName(HdfsWordCount) val sc = new SparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD. import sqlContext.createSchemaRDD val people = sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) val parquetFile = sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) } } gives the error Exception in thread main java.lang.NullPointerException at org.apache.spark.parquet$.main(parquet.scala:16) which is the line saveAsParquetFile. This works fine in linux but using in eclipse in windows gives the error. saveAsParquetFile not working on windows Key: SPARK-3284 URL: https://issues.apache.org/jira/browse/SPARK-3284 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Environment: Windows Reporter: Pravesh Jain Priority: Minor {code} object parquet { case class Person(name: String, age: Int) def main(args: Array[String]) { val sparkConf = new SparkConf().setMaster(local).setAppName(HdfsWordCount) val sc = new SparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD. import sqlContext.createSchemaRDD val people = sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) val parquetFile = sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) } } {code} gives the error Exception in thread main java.lang.NullPointerException at org.apache.spark.parquet$.main(parquet.scala:16) which is the line saveAsParquetFile. This works fine in linux but using in eclipse in windows gives the error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2015) Spark UI issues at scale
[ https://issues.apache.org/jira/browse/SPARK-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2015: -- Component/s: Web UI Spark UI issues at scale Key: SPARK-2015 URL: https://issues.apache.org/jira/browse/SPARK-2015 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.0.0 Reporter: Reynold Xin This is an umbrella ticket for issues related to Spark's web ui when we run Spark at scale (large datasets, large number of machines, or large number of tasks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3061) Maven build fails in Windows OS
[ https://issues.apache.org/jira/browse/SPARK-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-3061: - Assignee: Andrew Or (was: Josh Rosen) Re-assigning to Andrew, who's going to backport it. Maven build fails in Windows OS --- Key: SPARK-3061 URL: https://issues.apache.org/jira/browse/SPARK-3061 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2, 1.1.0 Environment: Windows Reporter: Masayoshi TSUZUKI Assignee: Andrew Or Priority: Minor Fix For: 1.2.0 Maven build fails in Windows OS with this error message. {noformat} [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec (default) on project spark-core_2.10: Command execution failed. Cannot run program unzip (in directory C:\path\to\gitofspark\python): CreateProcess error=2, w肳ꂽt@ - [Help 1] {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2334) Attribute Error calling PipelinedRDD.id() in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2334: -- Affects Version/s: 1.1.0 Attribute Error calling PipelinedRDD.id() in pyspark Key: SPARK-2334 URL: https://issues.apache.org/jira/browse/SPARK-2334 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0, 1.1.0 Reporter: Diana Carroll calling the id() function of a PipelinedRDD causes an error in PySpark. (Works fine in Scala.) The second id() call here fails, the first works: {code} r1 = sc.parallelize([1,2,3]) r1.id() r2=r1.map(lambda i: i+1) r2.id() {code} Error: {code} --- AttributeErrorTraceback (most recent call last) ipython-input-31-a0cf66fcf645 in module() 1 r2.id() /usr/lib/spark/python/pyspark/rdd.py in id(self) 180 A unique ID for this RDD (within its SparkContext). 181 -- 182 return self._id 183 184 def __repr__(self): AttributeError: 'PipelinedRDD' object has no attribute '_id' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-640) Update Hadoop 1 version to 1.1.0 (especially on AMIs)
[ https://issues.apache.org/jira/browse/SPARK-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121883#comment-14121883 ] Matei Zaharia commented on SPARK-640: - [~pwendell] what is our Hadoop 1 version on AMIs now? Update Hadoop 1 version to 1.1.0 (especially on AMIs) - Key: SPARK-640 URL: https://issues.apache.org/jira/browse/SPARK-640 Project: Spark Issue Type: New Feature Reporter: Matei Zaharia Hadoop 1.1.0 has a fix to the notorious trailing slash for directory objects in S3 issue: https://issues.apache.org/jira/browse/HADOOP-5836, so would be good to support on the AMIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3405) EC2 cluster creation on VPC
Dawson Reid created SPARK-3405: -- Summary: EC2 cluster creation on VPC Key: SPARK-3405 URL: https://issues.apache.org/jira/browse/SPARK-3405 Project: Spark Issue Type: New Feature Components: EC2, PySpark Affects Versions: 1.0.2 Environment: Ubuntu 12.04 Reporter: Dawson Reid Priority: Minor It would be very useful to be able to specify the EC2 VPC in which the Spark cluster should be created. When creating a Spark cluster on AWS via the spark-ec2 script there is no way to specify a VPC id of the VPC you would like the cluster to be created in. The script always creates the cluster in the default VPC. In my case I have deleted the default VPC and the spark-ec2 script errors out with the following : Setting up security groups... Creating security group test-master ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeVPCIdNotSpecified/CodeMessageNo default VPC for this user/Message/Error/ErrorsRequestID312a2281-81a1-4d3c-ba10-0593a886779d/RequestID/Response Traceback (most recent call last): File ./spark_ec2.py, line 860, in module main() File ./spark_ec2.py, line 852, in main real_main() File ./spark_ec2.py, line 735, in real_main conn, opts, cluster_name) File ./spark_ec2.py, line 247, in launch_cluster master_group = get_or_make_group(conn, cluster_name + -master) File ./spark_ec2.py, line 143, in get_or_make_group return conn.create_security_group(name, Spark EC2 group) File /home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py, line 2011, in create_security_group File /home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py, line 925, in get_object boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request ?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeVPCIdNotSpecified/CodeMessageNo default VPC for this user/Message/Error/ErrorsRequestID312a2281-81a1-4d3c-ba10-0593a886779d/RequestID/Response -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122006#comment-14122006 ] Hari Shreedharan commented on SPARK-3129: - Yes, so my initial goal is to be able to recover all the blocks that have not been made into an RDD yet (at which point it would be safe). There is data which may not have become a block yet (which are created using the += operator) - for now, I am going to call it fair game to say that we are going to be adding storeReliably(ArrayBuffer/Iterable) methods which are the only ones that store data such that they are guaranteed to be recovered. At a later stage, we could use something like a WAL on HDFS to recover even the += data, though that would affect performance. Prevent data loss in Spark Streaming Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Reporter: Hari Shreedharan Assignee: Hari Shreedharan Attachments: StreamingPreventDataLoss.pdf Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122017#comment-14122017 ] Hari Shreedharan commented on SPARK-3129: - [~tgraves] - Am I correct in assuming that using Akka automatically gives the shared secret authentication if spark.authenticate is set to true - if the AM is restarted by YARN itself (since it is the same application, it theoretically has access to the same shared secret and thus should be able to communicate via Akka)? Prevent data loss in Spark Streaming Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Reporter: Hari Shreedharan Assignee: Hari Shreedharan Attachments: StreamingPreventDataLoss.pdf Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122064#comment-14122064 ] Thomas Graves commented on SPARK-3129: -- On yarn, it generates the secret automatically. In cluster mode, it does it in the applicationMaster. Since it generates it in the applicationmaster, it goes away when the application master dies. If the secret was generated on the client side and populated into the credentials in the UGI similar to how we do tokens then a restart of the AM in cluster mode should be able to pick it back up. This won't work for client mode though since the client/spark driver wouldn't have a way to get ahold of the UGI again. Prevent data loss in Spark Streaming Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Reporter: Hari Shreedharan Assignee: Hari Shreedharan Attachments: StreamingPreventDataLoss.pdf Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3378) Replace the word SparkSQL with right word Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-3378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3378. - Resolution: Fixed Fix Version/s: 1.2.0 Replace the word SparkSQL with right word Spark SQL --- Key: SPARK-3378 URL: https://issues.apache.org/jira/browse/SPARK-3378 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Trivial Fix For: 1.2.0 In programming-guide.md, there are 2 SparkSQL. We should use Spark SQL instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122103#comment-14122103 ] Hari Shreedharan commented on SPARK-3129: - I am less worried about client mode, since most streaming applications would run in cluster mode. We can make this available only in the cluster mode. Prevent data loss in Spark Streaming Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Reporter: Hari Shreedharan Assignee: Hari Shreedharan Attachments: StreamingPreventDataLoss.pdf Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122114#comment-14122114 ] Hari Shreedharan commented on SPARK-3129: - Looks like simply moving the code that generates the secret and sets in the UGI to the Client class should take care of that. Prevent data loss in Spark Streaming Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Reporter: Hari Shreedharan Assignee: Hari Shreedharan Attachments: StreamingPreventDataLoss.pdf Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3405) EC2 cluster creation on VPC
[ https://issues.apache.org/jira/browse/SPARK-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3405: --- Component/s: (was: PySpark) EC2 cluster creation on VPC --- Key: SPARK-3405 URL: https://issues.apache.org/jira/browse/SPARK-3405 Project: Spark Issue Type: New Feature Components: EC2 Affects Versions: 1.0.2 Environment: Ubuntu 12.04 Reporter: Dawson Reid Priority: Minor It would be very useful to be able to specify the EC2 VPC in which the Spark cluster should be created. When creating a Spark cluster on AWS via the spark-ec2 script there is no way to specify a VPC id of the VPC you would like the cluster to be created in. The script always creates the cluster in the default VPC. In my case I have deleted the default VPC and the spark-ec2 script errors out with the following : Setting up security groups... Creating security group test-master ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeVPCIdNotSpecified/CodeMessageNo default VPC for this user/Message/Error/ErrorsRequestID312a2281-81a1-4d3c-ba10-0593a886779d/RequestID/Response Traceback (most recent call last): File ./spark_ec2.py, line 860, in module main() File ./spark_ec2.py, line 852, in main real_main() File ./spark_ec2.py, line 735, in real_main conn, opts, cluster_name) File ./spark_ec2.py, line 247, in launch_cluster master_group = get_or_make_group(conn, cluster_name + -master) File ./spark_ec2.py, line 143, in get_or_make_group return conn.create_security_group(name, Spark EC2 group) File /home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py, line 2011, in create_security_group File /home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py, line 925, in get_object boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request ?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeVPCIdNotSpecified/CodeMessageNo default VPC for this user/Message/Error/ErrorsRequestID312a2281-81a1-4d3c-ba10-0593a886779d/RequestID/Response -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122219#comment-14122219 ] Xiangrui Meng commented on SPARK-3403: -- I don't have a Windows system to test. There should be a runtime flag you can set to control the number of threads OpenBLAS use. Could you try that? I will test the code attached on OSX and report back. NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.1.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3406) Python persist API does not have a default storage level
holdenk created SPARK-3406: -- Summary: Python persist API does not have a default storage level Key: SPARK-3406 URL: https://issues.apache.org/jira/browse/SPARK-3406 Project: Spark Issue Type: Bug Components: PySpark Reporter: holdenk Priority: Minor PySpark's persist method on RDD's does not have a default storage level. This is different than the Scala API which defaults to in memory caching. This is minor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3390) sqlContext.jsonRDD fails on a complex structure of JSON array and JSON object nesting
[ https://issues.apache.org/jira/browse/SPARK-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-3390: Summary: sqlContext.jsonRDD fails on a complex structure of JSON array and JSON object nesting (was: sqlContext.jsonRDD fails on a complex structure of array and hashmap nesting) sqlContext.jsonRDD fails on a complex structure of JSON array and JSON object nesting - Key: SPARK-3390 URL: https://issues.apache.org/jira/browse/SPARK-3390 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Vida Ha Assignee: Yin Huai Priority: Critical I found a valid JSON string, but which Spark SQL fails to correctly parse: Try running these lines in a spark-shell to reproduce: {code:borderStyle=solid} val sqlContext = new org.apache.spark.sql.SQLContext(sc) val badJson = {\foo\: [[{\bar\: 0}]]} val rdd = sc.parallelize(badJson :: Nil) sqlContext.jsonRDD(rdd).count() {code} I've tried running these lines on the 1.0.2 release as well latest Spark1.1 release candidate, and I get this stack trace: {panel} org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0:3 failed 1 times, most recent failure: Exception failure in TID 7 on host localhost: scala.MatchError: StructType(List()) (of class org.apache.spark.sql.catalyst.types.StructType) org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:333) org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.AbstractTraversable.map(Traversable.scala:105) org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335) org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.AbstractTraversable.map(Traversable.scala:105) org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:365) scala.Option.map(Option.scala:145) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:364) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:349) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:349) org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51) org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3390) sqlContext.jsonRDD fails on a complex structure of array and hashmap nesting
[ https://issues.apache.org/jira/browse/SPARK-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122235#comment-14122235 ] Yin Huai commented on SPARK-3390: - Oh, I see the problem. I am out of town this week. Will fix it next week. sqlContext.jsonRDD fails on a complex structure of array and hashmap nesting Key: SPARK-3390 URL: https://issues.apache.org/jira/browse/SPARK-3390 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Vida Ha Assignee: Yin Huai Priority: Critical I found a valid JSON string, but which Spark SQL fails to correctly parse: Try running these lines in a spark-shell to reproduce: {code:borderStyle=solid} val sqlContext = new org.apache.spark.sql.SQLContext(sc) val badJson = {\foo\: [[{\bar\: 0}]]} val rdd = sc.parallelize(badJson :: Nil) sqlContext.jsonRDD(rdd).count() {code} I've tried running these lines on the 1.0.2 release as well latest Spark1.1 release candidate, and I get this stack trace: {panel} org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0:3 failed 1 times, most recent failure: Exception failure in TID 7 on host localhost: scala.MatchError: StructType(List()) (of class org.apache.spark.sql.catalyst.types.StructType) org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:333) org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.AbstractTraversable.map(Traversable.scala:105) org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335) org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.AbstractTraversable.map(Traversable.scala:105) org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:365) scala.Option.map(Option.scala:145) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:364) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:349) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:349) org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51) org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2430) Standarized Clustering Algorithm API and Framework
[ https://issues.apache.org/jira/browse/SPARK-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122242#comment-14122242 ] Yu Ishikawa commented on SPARK-2430: Hi [~rnowling] , I am very interested in this issue. If possible, I am willing to work with you. I think MLlib's high-level API should be consistent like Scikit-learn. You know, we can use the almost algorithms with `fit` and `predict` function in Scikit-learn. The consisntent API would be helpful for Spark user too. Standarized Clustering Algorithm API and Framework -- Key: SPARK-2430 URL: https://issues.apache.org/jira/browse/SPARK-2430 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Priority: Minor Recently, there has been a chorus of voices on the mailing lists about adding new clustering algorithms to MLlib. To support these additions, we should develop a common framework and API to reduce code duplication and keep the APIs consistent. At the same time, we can also expand the current API to incorporate requested features such as arbitrary distance metrics or pre-computed distance matrices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122249#comment-14122249 ] Yu Ishikawa commented on SPARK-2966: I'm sorry for not checking community discussion and JIRA issue. Thank you for let me know. We would be able to implement an approximation algorithm for hierarchical clustering with LSH. I think the approach of this issue is different from that of [SPARK-2429]. Should we merge this issue to [SPARK-2429] ? Add an approximation algorithm for hierarchical clustering to MLlib --- Key: SPARK-2966 URL: https://issues.apache.org/jira/browse/SPARK-2966 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yu Ishikawa Priority: Minor A hierarchical clustering algorithm is a useful unsupervised learning method. Koga. et al. proposed highly scalable hierarchical clustering altgorithm in (1). I would like to implement this method. I suggest adding an approximate hierarchical clustering algorithm to MLlib. I'd like this to be assigned to me. h3. Reference # Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing http://dl.acm.org/citation.cfm?id=1266811 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3310) Directly use currentTable without unnecessary implicit conversion
[ https://issues.apache.org/jira/browse/SPARK-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3310. - Resolution: Fixed Fix Version/s: 1.2.0 Directly use currentTable without unnecessary implicit conversion - Key: SPARK-3310 URL: https://issues.apache.org/jira/browse/SPARK-3310 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Priority: Minor Fix For: 1.2.0 We can directly use currentTable in function cacheTable without unnecessary implicit conversion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2219) AddJar doesn't work
[ https://issues.apache.org/jira/browse/SPARK-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2219. - Resolution: Fixed Fix Version/s: 1.2.0 AddJar doesn't work --- Key: SPARK-2219 URL: https://issues.apache.org/jira/browse/SPARK-2219 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122266#comment-14122266 ] RJ Nowling commented on SPARK-2966: --- No worries. Based on my reading of the Spark contribution guidelines ( https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark ), I think that the Spark community would prefer to have one good implementation of an algorithm instead of multiple similar algorithms. Since the community has stated a clear preference for divisive hierarchical clustering, I think that is a better aim. You seem very motivated and have made some good contributions -- would you like to take the lead on the hierarchical clustering? I can review your code to help you improve it. That said, I suggest you look at the comment I added to SPARK-2429 and see what you think of that approach. If you like the example code and papers, why don't you work on implementing it efficiently in Spark? Add an approximation algorithm for hierarchical clustering to MLlib --- Key: SPARK-2966 URL: https://issues.apache.org/jira/browse/SPARK-2966 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yu Ishikawa Priority: Minor A hierarchical clustering algorithm is a useful unsupervised learning method. Koga. et al. proposed highly scalable hierarchical clustering altgorithm in (1). I would like to implement this method. I suggest adding an approximate hierarchical clustering algorithm to MLlib. I'd like this to be assigned to me. h3. Reference # Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing http://dl.acm.org/citation.cfm?id=1266811 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2430) Standarized Clustering Algorithm API and Framework
[ https://issues.apache.org/jira/browse/SPARK-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122273#comment-14122273 ] RJ Nowling commented on SPARK-2430: --- Hi Yu, The community had suggested looking into scikit-learn's API so that is a good idea. I am hesitant to make backwards-incompatible API changes, however, until we know the new API will be stable for a long time. I think it would be best to implement a few more clustering algorithms to get a clear idea of what is similar vs different before making a new API. May I suggest you work on SPARK-2966 / SPARK-2429 first? RJ Standarized Clustering Algorithm API and Framework -- Key: SPARK-2430 URL: https://issues.apache.org/jira/browse/SPARK-2430 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Priority: Minor Recently, there has been a chorus of voices on the mailing lists about adding new clustering algorithms to MLlib. To support these additions, we should develop a common framework and API to reduce code duplication and keep the APIs consistent. At the same time, we can also expand the current API to incorporate requested features such as arbitrary distance metrics or pre-computed distance matrices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122274#comment-14122274 ] Saisai Shao commented on SPARK-3129: Hi [~hshreedharan]], thanks for your reply, is this PR (https://github.com/apache/spark/pull/1195) the one you mentioned about storeReliably()? According to my knowledge, this API aims to store bunch of messages into BM directly to make it reliable, but for some receiver like Kafka, socket and others, data is injected one by one message, we can't call storeReliably() each time because of efficiency and throughput concern, so we need to store these data locally to some amount, and then flush to BM using storeReliably(). So I think data will potentially be lost as we store it locally. These days I thought about WAL things, IMHO i think WAL would be a better solution compared to blocked store API. Prevent data loss in Spark Streaming Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Reporter: Hari Shreedharan Assignee: Hari Shreedharan Attachments: StreamingPreventDataLoss.pdf Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3407) Add Date type support
Cheng Hao created SPARK-3407: Summary: Add Date type support Key: SPARK-3407 URL: https://issues.apache.org/jira/browse/SPARK-3407 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122292#comment-14122292 ] Saisai Shao commented on SPARK-2926: Hi Matei, sorry for late response, I will test more scenarios with your notes, also factor out to see if some codes can be shared with ExternalSorter. Thanks a lot. Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle -- Key: SPARK-2926 URL: https://issues.apache.org/jira/browse/SPARK-2926 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.1.0 Reporter: Saisai Shao Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test Report.pdf Currently Spark has already integrated sort-based shuffle write, which greatly improve the IO performance and reduce the memory consumption when reducer number is very large. But for the reducer side, it still adopts the implementation of hash-based shuffle reader, which neglects the ordering attributes of map output data in some situations. Here we propose a MR style sort-merge like shuffle reader for sort-based shuffle to better improve the performance of sort-based shuffle. Working in progress code and performance test report will be posted later when some unit test bugs are fixed. Any comments would be greatly appreciated. Thanks a lot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3392) Set command always get undefined for key mapred.reduce.tasks
[ https://issues.apache.org/jira/browse/SPARK-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3392. - Resolution: Fixed Fix Version/s: 1.2.0 Set command always get undefined for key mapred.reduce.tasks Key: SPARK-3392 URL: https://issues.apache.org/jira/browse/SPARK-3392 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Trivial Fix For: 1.2.0 This is a tiny fix for getting the value of mapred.reduce.tasks, which make more sense for the hive user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3408) Limit operator doesn't work with sort based shuffle
Reynold Xin created SPARK-3408: -- Summary: Limit operator doesn't work with sort based shuffle Key: SPARK-3408 URL: https://issues.apache.org/jira/browse/SPARK-3408 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3409) Avoid pulling in Exchange operator itself in Exchange's closures
Reynold Xin created SPARK-3409: -- Summary: Avoid pulling in Exchange operator itself in Exchange's closures Key: SPARK-3409 URL: https://issues.apache.org/jira/browse/SPARK-3409 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin {code} val rdd = child.execute().mapPartitions { iter = if (sortBasedShuffleOn) { iter.map(r = (null, r.copy())) } else { val mutablePair = new MutablePair[Null, Row]() iter.map(r = mutablePair.update(null, r)) } } {code} The above snippet from Exchange references sortBasedShuffleOn within a closure, which requires pulling in the entire Exchange object in the closure. This is a tiny teeny optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122386#comment-14122386 ] sam commented on SPARK-1473: Good paper, the theory is very solid. My only concern is that the paper does not explicitly tackle the problem of probability estimation for high dimensionality, which for sparse data will be even worse. It just touches on the problem, saying: This in turn causes increasingly poor judgements for the in- clusion/exclusion of features. For precisely this reason, the research community have developed various low-dimensional approximations to (9). In the following sections, we will investigate the implicit statistical assumptions and empirical effects of these approximations Those mentioned sections do not go into theoretical detail, and therefore I disagree that the paper provides a single unified information theoretic framework for feature selection as it basically leaves the problem of probability estimation to the readers choice, and merely suggests the reader assumes some level of independence between features in order to implement an algorithm. [~dmborque] Do you know of any literature that does approach the problem of probability estimation in an information theoretic and philosophically justified way?? Anyway despite my concerns, this paper is still by far the best treatment of feature selection I have seen. Feature selection for high dimensional datasets --- Key: SPARK-1473 URL: https://issues.apache.org/jira/browse/SPARK-1473 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ignacio Zendejas Assignee: Alexander Ulanov Priority: Minor Labels: features For classification tasks involving large feature spaces in the order of tens of thousands or higher (e.g., text classification with n-grams, where n 1), it is often useful to rank and filter features that are irrelevant thereby reducing the feature space by at least one or two orders of magnitude without impacting performance on key evaluation metrics (accuracy/precision/recall). A feature evaluation interface which is flexible needs to be designed and at least two methods should be implemented with Information Gain being a priority as it has been shown to be amongst the most reliable. Special consideration should be taken in the design to account for wrapper methods (see research papers below) which are more practical for lower dimensional data. Relevant research: * Brown, G., Pocock, A., Zhao, M. J., Luján, M. (2012). Conditional likelihood maximisation: a unifying framework for information theoretic feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. * Forman, George. An extensive empirical study of feature selection metrics for text classification. The Journal of machine learning research 3 (2003): 1289-1305. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122390#comment-14122390 ] sam commented on SPARK-1473: [~dmm...@gmail.com] mentioning also (i cant work which david is the one that posted above) Feature selection for high dimensional datasets --- Key: SPARK-1473 URL: https://issues.apache.org/jira/browse/SPARK-1473 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ignacio Zendejas Assignee: Alexander Ulanov Priority: Minor Labels: features For classification tasks involving large feature spaces in the order of tens of thousands or higher (e.g., text classification with n-grams, where n 1), it is often useful to rank and filter features that are irrelevant thereby reducing the feature space by at least one or two orders of magnitude without impacting performance on key evaluation metrics (accuracy/precision/recall). A feature evaluation interface which is flexible needs to be designed and at least two methods should be implemented with Information Gain being a priority as it has been shown to be amongst the most reliable. Special consideration should be taken in the design to account for wrapper methods (see research papers below) which are more practical for lower dimensional data. Relevant research: * Brown, G., Pocock, A., Zhao, M. J., Luján, M. (2012). Conditional likelihood maximisation: a unifying framework for information theoretic feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. * Forman, George. An extensive empirical study of feature selection metrics for text classification. The Journal of machine learning research 3 (2003): 1289-1305. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3408) Limit operator doesn't work with sort based shuffle
[ https://issues.apache.org/jira/browse/SPARK-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122407#comment-14122407 ] Apache Spark commented on SPARK-3408: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2281 Limit operator doesn't work with sort based shuffle --- Key: SPARK-3408 URL: https://issues.apache.org/jira/browse/SPARK-3408 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3409) Avoid pulling in Exchange operator itself in Exchange's closures
[ https://issues.apache.org/jira/browse/SPARK-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122408#comment-14122408 ] Apache Spark commented on SPARK-3409: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2282 Avoid pulling in Exchange operator itself in Exchange's closures Key: SPARK-3409 URL: https://issues.apache.org/jira/browse/SPARK-3409 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin {code} val rdd = child.execute().mapPartitions { iter = if (sortBasedShuffleOn) { iter.map(r = (null, r.copy())) } else { val mutablePair = new MutablePair[Null, Row]() iter.map(r = mutablePair.update(null, r)) } } {code} The above snippet from Exchange references sortBasedShuffleOn within a closure, which requires pulling in the entire Exchange object in the closure. This is a tiny teeny optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal, rather than refer constant.
[ https://issues.apache.org/jira/browse/SPARK-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3410: -- Issue Type: Improvement (was: Bug) The priority of shutdownhook for ApplicationMaster should not be integer literal, rather than refer constant. - Key: SPARK-3410 URL: https://issues.apache.org/jira/browse/SPARK-3410 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Minor In ApplicationMaster, the priority of shutdown hook is set to 30, which expects higher than the priority of o.a.h.FileSystem. In FileSystem, the priority of shutdown hook is expressed as public constant named SHUTDOWN_HOOK_PRIORITY so I think it's better to use this constant for the priority of ApplicationMaster's shutdown hook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal, rather than refer constant.
[ https://issues.apache.org/jira/browse/SPARK-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122465#comment-14122465 ] Apache Spark commented on SPARK-3410: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2283 The priority of shutdownhook for ApplicationMaster should not be integer literal, rather than refer constant. - Key: SPARK-3410 URL: https://issues.apache.org/jira/browse/SPARK-3410 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Minor In ApplicationMaster, the priority of shutdown hook is set to 30, which expects higher than the priority of o.a.h.FileSystem. In FileSystem, the priority of shutdown hook is expressed as public constant named SHUTDOWN_HOOK_PRIORITY so I think it's better to use this constant for the priority of ApplicationMaster's shutdown hook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3411) Optimize the schedule procedure in Master
WangTaoTheTonic created SPARK-3411: -- Summary: Optimize the schedule procedure in Master Key: SPARK-3411 URL: https://issues.apache.org/jira/browse/SPARK-3411 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic Priority: Minor If the waiting driver array is too big, the drivers in it will be dispatched to the first worker we get(if it has enough resources), with or without the Randomization. We should do randomization every time we dispatch a driver, in order to better balance drivers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org