[jira] [Commented] (SPARK-5685) Show warning when users open text files compressed with non-splittable algorithms like gzip
[ https://issues.apache.org/jira/browse/SPARK-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311931#comment-14311931 ] Nicholas Chammas commented on SPARK-5685: - [~joshrosen] - What do you think of adding a warning like this? Show warning when users open text files compressed with non-splittable algorithms like gzip --- Key: SPARK-5685 URL: https://issues.apache.org/jira/browse/SPARK-5685 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Nicholas Chammas Priority: Minor This is a usability or user-friendliness issue. It's extremely common for people to load a text file compressed with gzip, process it, and then wonder why only 1 core in their cluster is doing any work. Some examples: * http://stackoverflow.com/q/28127119/877069 * http://stackoverflow.com/q/27531816/877069 I'm not sure how this problem can be generalized, but at the very least it would be helpful if Spark displayed some kind of warning in the common case when someone opens a gzipped file with {{sc.textFile}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5682) Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated SPARK-5682: Attachment: encrypted_shuffle.patch.4 encrypted_shuffle.patch.4 is how to reuse hadoop encrypted class to enable spark encrypted shuffle. How to use patch -p1encrypted_shuffle.patch.4 Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx, encrypted_shuffle.patch.4 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. We reuse hadoop encrypted shuffle feature to spark and because ugi credential info is necessary in encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5681) Calling graceful stop() immediately after start() on StreamingContext should not get stuck indefinitely
[ https://issues.apache.org/jira/browse/SPARK-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5681: - Component/s: Streaming Calling graceful stop() immediately after start() on StreamingContext should not get stuck indefinitely --- Key: SPARK-5681 URL: https://issues.apache.org/jira/browse/SPARK-5681 Project: Spark Issue Type: Bug Components: Streaming Reporter: Liang-Chi Hsieh Sometimes the receiver will be registered into tracker after ssc.stop is called. Especially when stop() is called immediately after start(). So the receiver doesn't get the StopReceiver message from the tracker. In this case, when you call stop() in graceful mode, stop() would get stuck indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311956#comment-14311956 ] Manoj Kumar commented on SPARK-5016: [~tgaloppo] I would like your inputs on this as well. GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5684) Key not found exception is thrown in case location of added partition to a parquet table is different than a path containing the partition values
[ https://issues.apache.org/jira/browse/SPARK-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311943#comment-14311943 ] Apache Spark commented on SPARK-5684: - User 'saucam' has created a pull request for this issue: https://github.com/apache/spark/pull/4469 Key not found exception is thrown in case location of added partition to a parquet table is different than a path containing the partition values - Key: SPARK-5684 URL: https://issues.apache.org/jira/browse/SPARK-5684 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.0 Reporter: Yash Datta Fix For: 1.3.0 Create a partitioned parquet table : create table test_table (dummy string) partitioned by (timestamp bigint) stored as parquet; Add a partition to the table and specify a different location: alter table test_table add partition (timestamp=9) location '/data/pth/different' Run a simple select * query we get an exception : 15/02/09 08:27:25 ERROR thriftserver.SparkSQLDriver: Failed in [select * from db4_mi2mi_binsrc1_default limit 5] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 21, localhost): java .util.NoSuchElementException: key not found: timestamp at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:141) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:128) at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:247) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) This happens because in parquet path it is assumed that (key=value) patterns are present in the partition location, which is not always the case! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311948#comment-14311948 ] irene rognoni commented on SPARK-5281: -- same issue here, since last week, no news on this? Registering table on RDD is giving MissingRequirementError -- Key: SPARK-5281 URL: https://issues.apache.org/jira/browse/SPARK-5281 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: sarsol Priority: Critical Application crashes on this line rdd.registerTempTable(temp) in 1.2 version when using sbt or Eclipse SCALA IDE Stacktrace Exception in thread main scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program Files\Java\jre7\lib\resources.jar;C:\Program Files\Java\jre7\lib\rt.jar;C:\Program Files\Java\jre7\lib\sunrsasign.jar;C:\Program Files\Java\jre7\lib\jsse.jar;C:\Program Files\Java\jre7\lib\jce.jar;C:\Program Files\Java\jre7\lib\charsets.jar;C:\Program Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) at com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo
[ https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311990#comment-14311990 ] Sean Owen commented on SPARK-5676: -- Dumb question, I know, but what is the spark-ec2 repo you guys refer to? This one? https://github.com/mesos/spark-ec2 The only repo for Spark is the main spark repo and it has all this license info buttoned up. Other repos are not part of Spark. License missing from spark-ec2 repo --- Key: SPARK-5676 URL: https://issues.apache.org/jira/browse/SPARK-5676 Project: Spark Issue Type: Bug Components: EC2 Reporter: Florian Verhein There is no LICENSE file or licence headers in the code in the spark-ec2 repo. Also, I believe there is no contributor license agreement notification in place (like there is in the main spark repo). It would be great to fix this (sooner better than later while contributors list is small), so that users wishing to use this part of Spark are not in doubt over licensing issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5679) Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method
[ https://issues.apache.org/jira/browse/SPARK-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311994#comment-14311994 ] Sean Owen commented on SPARK-5679: -- Same as SPARK-5227? Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method -- Key: SPARK-5679 URL: https://issues.apache.org/jira/browse/SPARK-5679 Project: Spark Issue Type: Bug Components: Spark Core, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Kostas Sakellis Priority: Blocker Please audit these and see if there are any assumptions with respect to File IO that might not hold in all cases. I'm happy to help if you can't find anything. These both failed in the same run: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-SBT/38/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink {code} org.apache.spark.metrics.InputOutputMetricsSuite.input metrics with mixed read method Failing for the past 13 builds (Since Failed#26 ) Took 48 sec. Error Message 2030 did not equal 6496 Stacktrace sbt.ForkMain$ForkError: 2030 did not equal 6496 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply$mcV$sp(InputOutputMetricsSuite.scala:135) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfter$$super$runTest(InputOutputMetricsSuite.scala:46) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.metrics.InputOutputMetricsSuite.runTest(InputOutputMetricsSuite.scala:46) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfterAll$$super$run(InputOutputMetricsSuite.scala:46) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) at
[jira] [Resolved] (SPARK-5473) Expose SSH failures after status checks pass
[ https://issues.apache.org/jira/browse/SPARK-5473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5473. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4262 [https://github.com/apache/spark/pull/4262] Expose SSH failures after status checks pass Key: SPARK-5473 URL: https://issues.apache.org/jira/browse/SPARK-5473 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.2.0 Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5473) Expose SSH failures after status checks pass
[ https://issues.apache.org/jira/browse/SPARK-5473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5473: - Assignee: Nicholas Chammas Expose SSH failures after status checks pass Key: SPARK-5473 URL: https://issues.apache.org/jira/browse/SPARK-5473 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.2.0 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5239) JdbcRDD throws java.lang.AbstractMethodError: oracle.jdbc.driver.xxxxxx.isClosed()Z
[ https://issues.apache.org/jira/browse/SPARK-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312040#comment-14312040 ] Apache Spark commented on SPARK-5239: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4470 JdbcRDD throws java.lang.AbstractMethodError: oracle.jdbc.driver.xx.isClosed()Z - Key: SPARK-5239 URL: https://issues.apache.org/jira/browse/SPARK-5239 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: centos6.4 + ojdbc14 Reporter: Gankun Luo Priority: Minor I try use JdbcRDD to operate the table of Oracle database, but failed. My test code as follows: {code} import java.sql.DriverManager import org.apache.spark.SparkContext import org.apache.spark.rdd.JdbcRDD import org.apache.spark.SparkConf object JdbcRDD4Oracle { def main(args: Array[String]) { val sc = new SparkContext(new SparkConf().setAppName(JdbcRDD4Oracle).setMaster(local[2])) val rdd = new JdbcRDD(sc, () = getConnection, getSQL, 12987, 13055, 3, r = { (r.getObject(HISTORY_ID), r.getObject(APPROVE_OPINION)) }) println(rdd.collect.toList) sc.stop() } def getConnection() = { Class.forName(oracle.jdbc.driver.OracleDriver).newInstance() DriverManager.getConnection(jdbc:oracle:thin:@hadoop000:1521/ORCL, scott, tiger) } def getSQL() = { select HISTORY_ID,APPROVE_OPINION from CI_APPROVE_HISTORY WHERE HISTORY_ID =? AND HISTORY_ID =? } } {code} Run the example, I get the following exception: {code} 09:56:48,302 [Executor task launch worker-0] ERROR Logging$class : Error in TaskCompletionListener java.lang.AbstractMethodError: oracle.jdbc.driver.OracleResultSetImpl.isClosed()Z at org.apache.spark.rdd.JdbcRDD$$anon$1.close(JdbcRDD.scala:99) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:71) at org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:71) at org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:85) at org.apache.spark.TaskContext$$anonfun$markTaskCompleted$1.apply(TaskContext.scala:110) at org.apache.spark.TaskContext$$anonfun$markTaskCompleted$1.apply(TaskContext.scala:108) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.markTaskCompleted(TaskContext.scala:108) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:64) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 09:56:48,302 [Executor task launch worker-1] ERROR Logging$class : Error in TaskCompletionListener {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5688) In Decision Trees, choosing a random subset of categories for each split
[ https://issues.apache.org/jira/browse/SPARK-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Denovitzer updated SPARK-5688: --- Labels: categorical decisiontree (was: categorical) In Decision Trees, choosing a random subset of categories for each split Key: SPARK-5688 URL: https://issues.apache.org/jira/browse/SPARK-5688 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Environment: Any Reporter: Eric Denovitzer Labels: categorical, decisiontree Fix For: 1.2.0 The categories on each subset chosen to build a split on a categorical variable was not random. The categories for the subset are chosen based on the binary representation of a number from 1 to (2^(number of categories)) - 2 (excludes empty and full subset). On the current implementation, the integers used for the subsets are 1..numSplits. This should be random instead of biasing towards the categories with the lower indexes. Another problem is that if numBins/2 is bigger than the possible subsets given a category set, it still considered the numSplits to be numBins/2. This should be the min of numBins/2 and (2^(number of categories)) - 2 (otherwise the same subsets might be considered more than once when choosing the splits). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4423) Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior
[ https://issues.apache.org/jira/browse/SPARK-4423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4423: - Assignee: Ilya Ganelin Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior --- Key: SPARK-4423 URL: https://issues.apache.org/jira/browse/SPARK-4423 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Josh Rosen Assignee: Ilya Ganelin {{foreach}} seems to be a common source of confusion for new users: in {{local}} mode, {{foreach}} can be used to update local variables on the driver, but programs that do this will not work properly when executed on clusters, since the {{foreach}} will update per-executor variables (note that this _will_ work correctly for accumulators, but not for other types of mutable objects). Similarly, I've seen users become confused when {{.foreach(println)}} doesn't print to the driver's standard output. At a minimum, we should improve the documentation to warn users against unsafe uses of {{foreach}} that won't work properly when transitioning from local mode to a real cluster. We might also consider changes to local mode so that its behavior more closely matches the cluster modes; this will require some discussion, though, since any change of behavior here would technically be a user-visible backwards-incompatible change (I don't think that we made any explicit guarantees about the current local-mode behavior, but someone might be relying on the current implicit behavior). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4655) Split Stage into ShuffleMapStage and ResultStage subclasses
[ https://issues.apache.org/jira/browse/SPARK-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312458#comment-14312458 ] Ilya Ganelin commented on SPARK-4655: - Hi [~joshrosen], I'd be happy to work on this. Thanks! Split Stage into ShuffleMapStage and ResultStage subclasses --- Key: SPARK-4655 URL: https://issues.apache.org/jira/browse/SPARK-4655 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen The scheduler's {{Stage}} class has many fields which are only applicable to result stages or shuffle map stages. As a result, I think that it makes sense to make {{Stage}} into an abstract base class with two subclasses, {{ResultStage}} and {{ShuffleMapStage}}. This would improve the understandability of the DAGScheduler code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4423) Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior
[ https://issues.apache.org/jira/browse/SPARK-4423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312424#comment-14312424 ] Ilya Ganelin commented on SPARK-4423: - I'll be happy to update this. Thank you. Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior --- Key: SPARK-4423 URL: https://issues.apache.org/jira/browse/SPARK-4423 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Josh Rosen {{foreach}} seems to be a common source of confusion for new users: in {{local}} mode, {{foreach}} can be used to update local variables on the driver, but programs that do this will not work properly when executed on clusters, since the {{foreach}} will update per-executor variables (note that this _will_ work correctly for accumulators, but not for other types of mutable objects). Similarly, I've seen users become confused when {{.foreach(println)}} doesn't print to the driver's standard output. At a minimum, we should improve the documentation to warn users against unsafe uses of {{foreach}} that won't work properly when transitioning from local mode to a real cluster. We might also consider changes to local mode so that its behavior more closely matches the cluster modes; this will require some discussion, though, since any change of behavior here would technically be a user-visible backwards-incompatible change (I don't think that we made any explicit guarantees about the current local-mode behavior, but someone might be relying on the current implicit behavior). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5651) Support 'create db.table' in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-5651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312344#comment-14312344 ] Apache Spark commented on SPARK-5651: - User 'OopsOutOfMemory' has created a pull request for this issue: https://github.com/apache/spark/pull/4473 Support 'create db.table' in HiveContext Key: SPARK-5651 URL: https://issues.apache.org/jira/browse/SPARK-5651 Project: Spark Issue Type: Bug Components: SQL Reporter: Yadong Qi Now spark version is only support ```create table table_in_database_creation.test1 as select * from src limit 1;``` in HiveContext. This patch is used to support ```create table `table_in_database_creation.test2` as select * from src limit 1;``` in HiveContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5570) No docs stating that `new SparkConf().set(spark.driver.memory, ...) will not work
[ https://issues.apache.org/jira/browse/SPARK-5570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312416#comment-14312416 ] Ilya Ganelin commented on SPARK-5570: - I'll fix this, can you please assign it to me? Thanks. No docs stating that `new SparkConf().set(spark.driver.memory, ...) will not work --- Key: SPARK-5570 URL: https://issues.apache.org/jira/browse/SPARK-5570 Project: Spark Issue Type: Bug Components: Documentation, Spark Core Affects Versions: 1.2.0 Reporter: Tathagata Das Assignee: Andrew Or -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends
[ https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-823: Assignee: Ilya Ganelin spark.default.parallelism's default is inconsistent across scheduler backends - Key: SPARK-823 URL: https://issues.apache.org/jira/browse/SPARK-823 Project: Spark Issue Type: Bug Components: Documentation, PySpark, Scheduler Affects Versions: 0.8.0, 0.7.3, 0.9.1 Reporter: Josh Rosen Assignee: Ilya Ganelin Priority: Minor The [0.7.3 configuration guide|http://spark-project.org/docs/latest/configuration.html] says that {{spark.default.parallelism}}'s default is 8, but the default is actually max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos scheduler, and {{threads}} for the local scheduler: https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150 Should this be clarified in the documentation? Should the Mesos scheduler backend's default be revised? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312328#comment-14312328 ] Twinkle Sachdeva commented on SPARK-4705: - Hi [~vanzin] Please take a look at the screenshot. I will make NA to be non-anchored element. It shows the UI for a history server, where some of the applications has been run on a scheduler where multiple attempts are not supported, whereas some of the applications has multiple attempts. Should we introduce a property, which will show multi-attempt UI by default? Driver retries in yarn-cluster mode always fail if event logging is enabled --- Key: SPARK-4705 URL: https://issues.apache.org/jira/browse/SPARK-4705 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Marcelo Vanzin Attachments: multi-attempts with no attempt based UI.png yarn-cluster mode will retry to run the driver in certain failure modes. If even logging is enabled, this will most probably fail, because: {noformat} Exception in thread Driver java.io.IOException: Log directory hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003 already exists! at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) {noformat} The even log path should be more unique. Or perhaps retries of the same app should clean up the old logs first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5687) in TaskResultGetter need to catch OutOfMemoryError.
[ https://issues.apache.org/jira/browse/SPARK-5687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312365#comment-14312365 ] Apache Spark commented on SPARK-5687: - User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/4474 in TaskResultGetter need to catch OutOfMemoryError. --- Key: SPARK-5687 URL: https://issues.apache.org/jira/browse/SPARK-5687 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Lianhui Wang because in enqueueSuccessfulTask there is another thread to fetch result, if result is very large,it maybe throw a OutOfMemoryError. so if we donot catch OutOfMemoryError, DAGDAGScheduler donot know the status of this task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5688) Splits for Categorical Variables in DecisionTrees
[ https://issues.apache.org/jira/browse/SPARK-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Denovitzer updated SPARK-5688: --- Summary: Splits for Categorical Variables in DecisionTrees (was: In Decision Trees, choosing a random subset of categories for each split) Splits for Categorical Variables in DecisionTrees - Key: SPARK-5688 URL: https://issues.apache.org/jira/browse/SPARK-5688 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Environment: Any Reporter: Eric Denovitzer Labels: categorical, decisiontree Fix For: 1.2.0 The categories on each subset chosen to build a split on a categorical variable was not random. The categories for the subset are chosen based on the binary representation of a number from 1 to (2^(number of categories)) - 2 (excludes empty and full subset). On the current implementation, the integers used for the subsets are 1..numSplits. This should be random instead of biasing towards the categories with the lower indexes. Another problem is that if numBins/2 is bigger than the possible subsets given a category set, it still considered the numSplits to be numBins/2. This should be the min of numBins/2 and (2^(number of categories)) - 2 (otherwise the same subsets might be considered more than once when choosing the splits). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5688) In Decision Trees, choosing a random subset of categories for each split
Eric Denovitzer created SPARK-5688: -- Summary: In Decision Trees, choosing a random subset of categories for each split Key: SPARK-5688 URL: https://issues.apache.org/jira/browse/SPARK-5688 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Environment: Any Reporter: Eric Denovitzer Fix For: 1.2.0 The categories on each subset chosen to build a split on a categorical variable was not random. The categories for the subset are chosen based on the binary representation of a number from 1 to (2^(number of categories)) - 2 (excludes empty and full subset). On the current implementation, the integers used for the subsets are 1..numSplits. This should be random instead of biasing towards the categories with the lower indexes. Another problem is that if numBins/2 is bigger than the possible subsets given a category set, it still considered the numSplits to be numBins/2. This should be the min of numBins/2 and (2^(number of categories)) - 2 (otherwise the same subsets might be considered more than once when choosing the splits). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5688) Splits for Categorical Variables in DecisionTrees
[ https://issues.apache.org/jira/browse/SPARK-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312403#comment-14312403 ] Apache Spark commented on SPARK-5688: - User 'edenovit' has created a pull request for this issue: https://github.com/apache/spark/pull/4475 Splits for Categorical Variables in DecisionTrees - Key: SPARK-5688 URL: https://issues.apache.org/jira/browse/SPARK-5688 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Environment: Any Reporter: Eric Denovitzer Labels: categorical, decisiontree Fix For: 1.2.0 The categories on each subset chosen to build a split on a categorical variable was not random. The categories for the subset are chosen based on the binary representation of a number from 1 to (2^(number of categories)) - 2 (excludes empty and full subset). On the current implementation, the integers used for the subsets are 1..numSplits. This should be random instead of biasing towards the categories with the lower indexes. Another problem is that if numBins/2 is bigger than the possible subsets given a category set, it still considered the numSplits to be numBins/2. This should be the min of numBins/2 and (2^(number of categories)) - 2 (otherwise the same subsets might be considered more than once when choosing the splits). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5079) Detect failed jobs / batches in Spark Streaming unit tests
[ https://issues.apache.org/jira/browse/SPARK-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312415#comment-14312415 ] Ilya Ganelin commented on SPARK-5079: - I can work on this - can you please assign it to me? Thank you. Detect failed jobs / batches in Spark Streaming unit tests -- Key: SPARK-5079 URL: https://issues.apache.org/jira/browse/SPARK-5079 Project: Spark Issue Type: Bug Components: Streaming Reporter: Josh Rosen Currently, it is possible to write Spark Streaming unit tests where Spark jobs fail but the streaming tests succeed because we rely on wall-clock time plus output comparision in order to check whether a test has passed, and hence may miss cases where errors occurred if they didn't affect these results. We should strengthen the tests to check that no job failures occurred while processing batches. See https://github.com/apache/spark/pull/3832#issuecomment-68580794 for additional context. The StreamingTestWaiter in https://github.com/apache/spark/pull/3801 might also fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5689) Document what can be run in different YARN modes
Thomas Graves created SPARK-5689: Summary: Document what can be run in different YARN modes Key: SPARK-5689 URL: https://issues.apache.org/jira/browse/SPARK-5689 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves We should document what can be run in the different yarn modes. For instances, the interactive shell only work in yarn client mode, recently with https://github.com/apache/spark/pull/3976 users can run python scripts in cluster mode, etc.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends
[ https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-823. - Resolution: Fixed spark.default.parallelism's default is inconsistent across scheduler backends - Key: SPARK-823 URL: https://issues.apache.org/jira/browse/SPARK-823 Project: Spark Issue Type: Bug Components: Documentation, PySpark, Scheduler Affects Versions: 0.8.0, 0.7.3, 0.9.1 Reporter: Josh Rosen Assignee: Ilya Ganelin Priority: Minor The [0.7.3 configuration guide|http://spark-project.org/docs/latest/configuration.html] says that {{spark.default.parallelism}}'s default is 8, but the default is actually max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos scheduler, and {{threads}} for the local scheduler: https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150 Should this be clarified in the documentation? Should the Mesos scheduler backend's default be revised? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Twinkle Sachdeva updated SPARK-4705: Attachment: multi-attempts with no attempt based UI.png Driver retries in yarn-cluster mode always fail if event logging is enabled --- Key: SPARK-4705 URL: https://issues.apache.org/jira/browse/SPARK-4705 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Marcelo Vanzin Attachments: multi-attempts with no attempt based UI.png yarn-cluster mode will retry to run the driver in certain failure modes. If even logging is enabled, this will most probably fail, because: {noformat} Exception in thread Driver java.io.IOException: Log directory hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003 already exists! at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) {noformat} The even log path should be more unique. Or perhaps retries of the same app should clean up the old logs first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5703) JobProgressListener throws empty.max error
[ https://issues.apache.org/jira/browse/SPARK-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5703: - Description: In JobProgressListener, if you have a JobEnd that does not have a corresponding JobStart, AND you render the AllJobsPage, then you'll run into the empty.max exception. I ran into this when trying to replay parts of an event log after trimming a few events. Not a common use case I agree, but I'd argue that it should never fail on empty.max. was: In JobProgressListener, if you have a JobEnd that does not have a corresponding JobStart, AND you render the AllJobsPage, then you'll run into the empty.max exception. I ran into this when trying to replay parts of an event log after trimming a few events. Not a common use case I agree, but it should not fail with empty.max. JobProgressListener throws empty.max error -- Key: SPARK-5703 URL: https://issues.apache.org/jira/browse/SPARK-5703 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical In JobProgressListener, if you have a JobEnd that does not have a corresponding JobStart, AND you render the AllJobsPage, then you'll run into the empty.max exception. I ran into this when trying to replay parts of an event log after trimming a few events. Not a common use case I agree, but I'd argue that it should never fail on empty.max. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5703) JobProgressListener throws empty.max error
[ https://issues.apache.org/jira/browse/SPARK-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5703: - Summary: JobProgressListener throws empty.max error (was: JobProgressListener throws empty.max error in HS) JobProgressListener throws empty.max error -- Key: SPARK-5703 URL: https://issues.apache.org/jira/browse/SPARK-5703 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical In JobProgressListener, if you have a JobEnd that does not have a corresponding JobStart, AND you render the AllJobsPage, then you'll run into the empty.max exception. I ran into this when trying to replay parts of an event log after trimming a few events. Not a common use case I agree, but it should not fail with empty.max. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5703) JobProgressListener throws empty.max error in HS
Andrew Or created SPARK-5703: Summary: JobProgressListener throws empty.max error in HS Key: SPARK-5703 URL: https://issues.apache.org/jira/browse/SPARK-5703 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical In JobProgressListener, if you have a JobEnd that does not have a corresponding JobStart, AND you render the AllJobsPage, then you'll run into the empty.max exception. I ran into this when trying to replay parts of an event log after trimming a few events. Not a common use case I agree, but it should not fail with empty.max. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313652#comment-14313652 ] Mark Khaitman commented on SPARK-4105: -- We're running 1.2.1-rc2 on our cluster and running into the exact same problem. Several different jobs by different users will typically run perfectly fine, and then another identical run randomly throws the FAILED_TO_UNCOMPRESS(5) error, which causes the job to fail altogether actually. I'll try to re-produce this somehow, though it is a tricky one! FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle - Key: SPARK-4105 URL: https://issues.apache.org/jira/browse/SPARK-4105 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during shuffle read. Here's a sample stacktrace from an executor: {code} 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 33053) java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58) at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at
[jira] [Created] (SPARK-5707) Enabling spark.sql.codegen throws ClassNotFound exception
Yi Yao created SPARK-5707: - Summary: Enabling spark.sql.codegen throws ClassNotFound exception Key: SPARK-5707 URL: https://issues.apache.org/jira/browse/SPARK-5707 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: yarn-client mode Reporter: Yi Yao org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 133.0 failed 4 times, most recent failure: Lost task 13.3 in stage 133.0 (TID 3066, cdh52-node2): java.io.IOException: com.esotericsoftware.kryo.KryoException: Unable to find class: __wrapper$1$81257352e1c844aebf09cb84fe9e7459.__wrapper$1$81257352e1c844aebf09cb84fe9e7459$SpecificRow$1 Serialization trace: hashTable (org.apache.spark.sql.execution.joins.UniqueKeyHashedRelation) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:62) at org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:61) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) SQL: INSERT INTO TABLE ${hiveconf:TEMP_TABLE} SELECT s_store_name, pr_review_date, pr_review_content FROM ( --select store_name for stores with flat or declining sales in 3 consecutive months. SELECT s_store_name FROM store s JOIN ( -- linear regression part SELECT temp.cat AS cat, --SUM(temp.x)as sumX, --SUM(temp.y)as sumY, --SUM(temp.xy)as
[jira] [Created] (SPARK-5708) Add Slf4jSink to Spark Metrics Sink
Judy Nash created SPARK-5708: Summary: Add Slf4jSink to Spark Metrics Sink Key: SPARK-5708 URL: https://issues.apache.org/jira/browse/SPARK-5708 Project: Spark Issue Type: Bug Reporter: Judy Nash Add Slf4jSink to the currently supported metric sinks. This is convenient for those who want metrics data for telemetry purposes, but want to reuse the pre-setup log4j pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2653) Heap size should be the sum of driver.memory and executor.memory in local mode
[ https://issues.apache.org/jira/browse/SPARK-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313664#comment-14313664 ] liu chang commented on SPARK-2653: -- Hi, Davies, what's wrong with this? Heap size should be the sum of driver.memory and executor.memory in local mode -- Key: SPARK-2653 URL: https://issues.apache.org/jira/browse/SPARK-2653 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Davies Liu Priority: Minor Original Estimate: 1h Remaining Estimate: 1h In local mode, the driver and executor run in the same JVM, so the heap size of JVM should be the sum of spark.driver.memory and spark.executor.memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5703) AllJobsPage throws empty.max error
[ https://issues.apache.org/jira/browse/SPARK-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313332#comment-14313332 ] Apache Spark commented on SPARK-5703: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/4490 AllJobsPage throws empty.max error -- Key: SPARK-5703 URL: https://issues.apache.org/jira/browse/SPARK-5703 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical In JobProgressListener, if you have a JobEnd that does not have a corresponding JobStart, AND you render the AllJobsPage, then you'll run into the empty.max exception. I ran into this when trying to replay parts of an event log after trimming a few events. Not a common use case I agree, but I'd argue that it should never fail on empty.max. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5597) Model import/export for DecisionTree and ensembles
[ https://issues.apache.org/jira/browse/SPARK-5597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313388#comment-14313388 ] Apache Spark commented on SPARK-5597: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4493 Model import/export for DecisionTree and ensembles -- Key: SPARK-5597 URL: https://issues.apache.org/jira/browse/SPARK-5597 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley See parent JIRA for more info. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5558) pySpark zip function unexpected errors
[ https://issues.apache.org/jira/browse/SPARK-5558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313434#comment-14313434 ] Charles Hayden edited comment on SPARK-5558 at 2/10/15 2:59 AM: This seems to be working as expected in 1.3 branch and in master. was (Author: cchayden): This seems to be working as expected in 1.3 branch and in main. pySpark zip function unexpected errors -- Key: SPARK-5558 URL: https://issues.apache.org/jira/browse/SPARK-5558 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Charles Hayden Labels: pyspark Example: {quote} x = sc.parallelize(range(0,5)) y = x.map(lambda x: x+1000, preservesPartitioning=True) y.take(10) x.zip\(y).collect() {quote} Fails in the JVM: py4J: org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition If the range is changed to range(0,1000) it fails in pySpark code: ValueError: Can not deserialize RDD with different number of items in pair: (100, 1) It also fails if y.take(10) is replaced with y.toDebugString() It even fails if we print y._jrdd -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5706) Support inference schema from a single json string
Cheng Hao created SPARK-5706: Summary: Support inference schema from a single json string Key: SPARK-5706 URL: https://issues.apache.org/jira/browse/SPARK-5706 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao We notice some developers are complaining the json parsing is very slow, particularly in inferring schema. Some of them suggesting if we can provide an simple interface for inferring the schema by providing a single complete json string record, instead of sampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313380#comment-14313380 ] Apache Spark commented on SPARK-5682: - User 'kellyzly' has created a pull request for this issue: https://github.com/apache/spark/pull/4491 Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. We reuse hadoop encrypted shuffle feature to spark and because ugi credential info is necessary in encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5682) Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated SPARK-5682: Attachment: (was: encrypted_shuffle.patch.4) Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. We reuse hadoop encrypted shuffle feature to spark and because ugi credential info is necessary in encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5705) Explore GPU-accelerated Linear Algebra Libraries
[ https://issues.apache.org/jira/browse/SPARK-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313398#comment-14313398 ] Evan Sparks commented on SPARK-5705: This JIRA is a continuation of this thread: http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-td10481.html To summarise - high-speed linear algebra operations including, but not limited to, matrix multiplies and solves have the potential to make certain machine learning operations faster on spark. However, we've got to be careful to balance the overheads of copying data/calling out to the GPU with other factors in the design of the system. Additionally - getting these libraries compiled, linked, built, and configured on a target system is unfortunately not trivial. We should make sure we have a standard process for doing this (perhaps starting with this codebase: http://github.com/shivaram/matrix-bench). Maybe we should start with some applications where we think GPU acceleration could help? Neural nets is one, LDA is another - others? Explore GPU-accelerated Linear Algebra Libraries Key: SPARK-5705 URL: https://issues.apache.org/jira/browse/SPARK-5705 Project: Spark Issue Type: Bug Components: MLlib Reporter: Evan Sparks Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5558) pySpark zip function unexpected errors
[ https://issues.apache.org/jira/browse/SPARK-5558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313434#comment-14313434 ] Charles Hayden commented on SPARK-5558: --- This seems to be working as expected in 1.3 branch and in main. pySpark zip function unexpected errors -- Key: SPARK-5558 URL: https://issues.apache.org/jira/browse/SPARK-5558 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Charles Hayden Labels: pyspark Example: {quote} x = sc.parallelize(range(0,5)) y = x.map(lambda x: x+1000, preservesPartitioning=True) y.take(10) x.zip\(y).collect() {quote} Fails in the JVM: py4J: org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition If the range is changed to range(0,1000) it fails in pySpark code: ValueError: Can not deserialize RDD with different number of items in pair: (100, 1) It also fails if y.take(10) is replaced with y.toDebugString() It even fails if we print y._jrdd -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5597) Model import/export for DecisionTree and ensembles
[ https://issues.apache.org/jira/browse/SPARK-5597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5597. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4493 [https://github.com/apache/spark/pull/4493] Model import/export for DecisionTree and ensembles -- Key: SPARK-5597 URL: https://issues.apache.org/jira/browse/SPARK-5597 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Fix For: 1.3.0 See parent JIRA for more info. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5704) createDataFrame replace applySchema/inferSchema
[ https://issues.apache.org/jira/browse/SPARK-5704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-5704: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-5166 createDataFrame replace applySchema/inferSchema --- Key: SPARK-5704 URL: https://issues.apache.org/jira/browse/SPARK-5704 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5705) Explore GPU-accelerated Linear Algebra Libraries
Evan Sparks created SPARK-5705: -- Summary: Explore GPU-accelerated Linear Algebra Libraries Key: SPARK-5705 URL: https://issues.apache.org/jira/browse/SPARK-5705 Project: Spark Issue Type: Bug Components: MLlib Reporter: Evan Sparks Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2653) Heap size should be the sum of driver.memory and executor.memory in local mode
[ https://issues.apache.org/jira/browse/SPARK-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313680#comment-14313680 ] Davies Liu commented on SPARK-2653: --- Right now, in local mode, only one of spark.driver.memory or spark.executor.memory is used as the heap size of JVM (depends on the order of command arguments). I think it should be the sum of them. Heap size should be the sum of driver.memory and executor.memory in local mode -- Key: SPARK-2653 URL: https://issues.apache.org/jira/browse/SPARK-2653 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Davies Liu Priority: Minor Original Estimate: 1h Remaining Estimate: 1h In local mode, the driver and executor run in the same JVM, so the heap size of JVM should be the sum of spark.driver.memory and spark.executor.memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5709) Add EXPLAIN support for DataFrame API for debugging purpose
Cheng Hao created SPARK-5709: Summary: Add EXPLAIN support for DataFrame API for debugging purpose Key: SPARK-5709 URL: https://issues.apache.org/jira/browse/SPARK-5709 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5702) Allow short names for built-in data sources
Reynold Xin created SPARK-5702: -- Summary: Allow short names for built-in data sources Key: SPARK-5702 URL: https://issues.apache.org/jira/browse/SPARK-5702 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin e.g. json, parquet, jdbc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5703) AllJobsPage throws empty.max error
[ https://issues.apache.org/jira/browse/SPARK-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5703: - Summary: AllJobsPage throws empty.max error (was: JobProgressListener throws empty.max error) AllJobsPage throws empty.max error -- Key: SPARK-5703 URL: https://issues.apache.org/jira/browse/SPARK-5703 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical In JobProgressListener, if you have a JobEnd that does not have a corresponding JobStart, AND you render the AllJobsPage, then you'll run into the empty.max exception. I ran into this when trying to replay parts of an event log after trimming a few events. Not a common use case I agree, but I'd argue that it should never fail on empty.max. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2653) Heap size should be the sum of driver.memory and executor.memory in local mode
[ https://issues.apache.org/jira/browse/SPARK-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313663#comment-14313663 ] liu chang commented on SPARK-2653: -- Hi, Davies, what's wrong with this? Heap size should be the sum of driver.memory and executor.memory in local mode -- Key: SPARK-2653 URL: https://issues.apache.org/jira/browse/SPARK-2653 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Davies Liu Priority: Minor Original Estimate: 1h Remaining Estimate: 1h In local mode, the driver and executor run in the same JVM, so the heap size of JVM should be the sum of spark.driver.memory and spark.executor.memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5710) Combines two adjacent `Cast` expressions into one
guowei created SPARK-5710: - Summary: Combines two adjacent `Cast` expressions into one Key: SPARK-5710 URL: https://issues.apache.org/jira/browse/SPARK-5710 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: guowei A plan after `analyzer` with `typeCoercionRules` may produce many `cast` expressions. we can combine the adjacent ones. For example. create table test(a decimal(3,1)); explain select * from test where a*2-11; == Physical Plan == Filter (CAST(CAST((CAST(CAST((CAST(a#5, DecimalType()) * 2), DecimalType(21,1)), DecimalType()) - 1), DecimalType(22,1)), DecimalType()) 1) HiveTableScan [a#5], (MetastoreRelation default, test, None), None -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5570) No docs stating that `new SparkConf().set(spark.driver.memory, ...) will not work
[ https://issues.apache.org/jira/browse/SPARK-5570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312416#comment-14312416 ] Ilya Ganelin edited comment on SPARK-5570 at 2/9/15 4:27 PM: - I would be happy to fix this. Thank you. was (Author: ilganeli): I'll fix this, can you please assign it to me? Thanks. No docs stating that `new SparkConf().set(spark.driver.memory, ...) will not work --- Key: SPARK-5570 URL: https://issues.apache.org/jira/browse/SPARK-5570 Project: Spark Issue Type: Bug Components: Documentation, Spark Core Affects Versions: 1.2.0 Reporter: Tathagata Das Assignee: Andrew Or -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends
[ https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312382#comment-14312382 ] Ilya Ganelin commented on SPARK-823: Hi [~joshrosen] I believe the documentation is up to date and I reviewed all usages of spark.default.parallelism and found no inconsistencies with the documentation. The only thing that is un-documented with regards to the usage of spark.default.parallelism is how it's used within the Partitioner class in both Spark and Python. If defined, the default number of partitions created is equal to spark.default.parallelism - otherwise, it's the local number of partitions. I think this issue can be closed - I don't think that particular case needs to be publicly documented (it's clearly evident in the code what is going on). spark.default.parallelism's default is inconsistent across scheduler backends - Key: SPARK-823 URL: https://issues.apache.org/jira/browse/SPARK-823 Project: Spark Issue Type: Bug Components: Documentation, PySpark, Scheduler Affects Versions: 0.8.0, 0.7.3, 0.9.1 Reporter: Josh Rosen Priority: Minor The [0.7.3 configuration guide|http://spark-project.org/docs/latest/configuration.html] says that {{spark.default.parallelism}}'s default is 8, but the default is actually max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos scheduler, and {{threads}} for the local scheduler: https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150 Should this be clarified in the documentation? Should the Mesos scheduler backend's default be revised? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4600) org.apache.spark.graphx.VertexRDD.diff does not work
[ https://issues.apache.org/jira/browse/SPARK-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312589#comment-14312589 ] Brennon York commented on SPARK-4600: - I can take this, thanks! org.apache.spark.graphx.VertexRDD.diff does not work Key: SPARK-4600 URL: https://issues.apache.org/jira/browse/SPARK-4600 Project: Spark Issue Type: Bug Components: GraphX Environment: scala 2.10.4 spark 1.1.0 Reporter: Teppei Tosa Labels: graphx VertexRDD.diff doesn't work. For example : val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 2L).map(id = (id, id.toInt))) setA.collect.foreach(println(_)) // (0,0) // (1,1) val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(1L until 3L).map(id = (id, id.toInt))) setB.collect.foreach(println(_)) // (1,1) // (2,2) val diff = setA.diff(setB) diff.collect.foreach(println(_)) // printed none -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1142) Allow adding jars on app submission, outside of code
[ https://issues.apache.org/jira/browse/SPARK-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1142. Resolution: Not a Problem Allow adding jars on app submission, outside of code Key: SPARK-1142 URL: https://issues.apache.org/jira/browse/SPARK-1142 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 0.9.0 Reporter: Sandy Pérez González Assignee: Sandy Ryza yarn-standalone mode supports an option that allows adding jars that will be distributed on the cluster with job submission. Providing similar functionality for other app submission modes will allow the spark-app script proposed in SPARK-1126 to support an add-jars option that works for every submit mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5691) Preventing duplicate registering of an application has incorrect logic
Matt Cheah created SPARK-5691: - Summary: Preventing duplicate registering of an application has incorrect logic Key: SPARK-5691 URL: https://issues.apache.org/jira/browse/SPARK-5691 Project: Spark Issue Type: Bug Affects Versions: 1.2.0, 1.1.1 Reporter: Matt Cheah Fix For: 1.3.0 If an application registers twice with the Master, the Master accepts both copies and they both show up in the UI and consume resources. This is incorrect behavior. This happens inadvertently in regular usage when the Master is under high load, but it boils down to: when an application times out registering with the master and sends a second registration message, but the Master is still alive, it processes the first registration message for the application but also erroneously processes the second registration message instead of discarding it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5692) Model import/export for Word2Vec
Xiangrui Meng created SPARK-5692: Summary: Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Supoort save and load for Word2VecModel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5692: - Description: Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. (was: Supoort save and load for Word2VecModel.) Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5693) Install Pandas on Jenkins machines and enable to_pandas doctest for DataFrames
Reynold Xin created SPARK-5693: -- Summary: Install Pandas on Jenkins machines and enable to_pandas doctest for DataFrames Key: SPARK-5693 URL: https://issues.apache.org/jira/browse/SPARK-5693 Project: Spark Issue Type: Improvement Components: SQL, Tests Reporter: Reynold Xin Assignee: Patrick Wendell DataFrame.to_pandas doctests are disabled as Jenkins machines don't have Pandas installed. [~pwendell] I assigned this to you but feel free to delegate. Thanks. cc [~davies] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5694) Python API for evaluation metrics
Xiangrui Meng created SPARK-5694: Summary: Python API for evaluation metrics Key: SPARK-5694 URL: https://issues.apache.org/jira/browse/SPARK-5694 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Add Python API for evaluation metrics defined under `mllib.evaluation`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4900) MLlib SingularValueDecomposition ARPACK IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312769#comment-14312769 ] Mike Beyer commented on SPARK-4900: --- put a snapshot test data 1000x1000 matrix to https://dl.dropboxusercontent.com/u/8489998/test_matrix_1.zip calling: String filename = /custompath/27637/test_matrix_1; RDDVector vectorRDD = MLUtils.loadVectors(javaSparkContext.sc(), filename); vectorRDD.cache(); System.out.println(trtRowRDD.count():\t + vectorRDD.count()); RowMatrix rowMatrix = new RowMatrix(vectorRDD); System.out.println(rowMatrix.numRows():\t + rowMatrix.numRows()); System.out.println(rowMatrix.numCols():\t + rowMatrix.numCols()); { int k = 10; boolean computeU = true; double rCond = 1.0E-9d; SingularValueDecompositionRowMatrix, Matrix svd = rowMatrix.computeSVD(k, computeU, rCond); RowMatrix u = svd.U(); RDDVector uRowsRDD = u.rows(); System.out.println(uRowsRDD.count():\t + uRowsRDD.count()); Vector s = svd.s(); System.out.println(s.size():\t + s.size()); Matrix v = svd.V(); System.out.println(v.numRows():\t + v.numRows()); System.out.println(v.numCols():\t + v.numCols()); } results in: maxFeatureSpaceTermNumber: 1000 trtRowRDD.count(): 1000 rowMatrix.numRows():1000 rowMatrix.numCols():1000 15/02/09 19:56:59 WARN PrimaryRunnerSpark: java.lang.IllegalStateException: ARPACK returns non-zero info = 3 Please refer ARPACK user guide for error message. at org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:120) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:258) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:190) at com.example.processing.spark.SVDProcessing2.createSVD_2(SVDProcessing2.java:184) at com.example.processing.spark.RunnerSpark.main(PrimaryRunnerSpark.java:27) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at sbt.Run.invokeMain(Run.scala:67) at sbt.Run.run0(Run.scala:61) at sbt.Run.sbt$Run$$execute$1(Run.scala:51) at sbt.Run$$anonfun$run$1.apply$mcV$sp(Run.scala:55) at sbt.Run$$anonfun$run$1.apply(Run.scala:55) at sbt.Run$$anonfun$run$1.apply(Run.scala:55) at sbt.Logger$$anon$4.apply(Logger.scala:85) at sbt.TrapExit$App.run(TrapExit.scala:248) at java.lang.Thread.run(Thread.java:745) 15/02/09 19:56:59 INFO TimeCounter: TIMER [com.example.processing.spark.PrimaryRunnerSpark] : 13.0 Seconds TIMER [com.example.processing.spark.PrimaryRunnerSpark] : 13.0 Seconds 15/02/09 19:56:59 ERROR ContextCleaner: Error cleaning broadcast 20 java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137) at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227) at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66) at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:185) at
[jira] [Updated] (SPARK-5195) when hive table is query with alias the cache data lose effectiveness.
[ https://issues.apache.org/jira/browse/SPARK-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5195: --- Assignee: yixiaohua when hive table is query with alias the cache data lose effectiveness. Key: SPARK-5195 URL: https://issues.apache.org/jira/browse/SPARK-5195 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: yixiaohua Assignee: yixiaohua Fix For: 1.3.0 override the MetastoreRelation's sameresult method only compare databasename and table name because in previous : cache table t1; select count() from t1; it will read data from memory but the sql below will not,instead it read from hdfs: select count() from t1 t; because cache data is keyed by logical plan and compare with sameResult ,so when table with alias the same table 's logicalplan is not the same logical plan with out alias so modify the sameresult method only compare databasename and table name -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5343) ShortestPaths traverses backwards
[ https://issues.apache.org/jira/browse/SPARK-5343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312651#comment-14312651 ] Apache Spark commented on SPARK-5343: - User 'brennonyork' has created a pull request for this issue: https://github.com/apache/spark/pull/4478 ShortestPaths traverses backwards - Key: SPARK-5343 URL: https://issues.apache.org/jira/browse/SPARK-5343 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.0 Reporter: Michael Malak GraphX ShortestPaths seems to be following edges backwards instead of forwards: import org.apache.spark.graphx._ val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L, lib.ShortestPaths.run(g,Array(3)).vertices.collect res1: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 - 0)), (2,Map())) lib.ShortestPaths.run(g,Array(1)).vertices.collect res2: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), (3,Map(1 - 2)), (2,Map(1 - 1))) The following changes may be what will make it run forward: Change one occurrence of src to dst in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64 Change three occurrences of dst to src in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5679) Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method
[ https://issues.apache.org/jira/browse/SPARK-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5679: -- Labels: flaky-test (was: ) Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method -- Key: SPARK-5679 URL: https://issues.apache.org/jira/browse/SPARK-5679 Project: Spark Issue Type: Bug Components: Spark Core, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Kostas Sakellis Priority: Blocker Labels: flaky-test Please audit these and see if there are any assumptions with respect to File IO that might not hold in all cases. I'm happy to help if you can't find anything. These both failed in the same run: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-SBT/38/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink {code} org.apache.spark.metrics.InputOutputMetricsSuite.input metrics with mixed read method Failing for the past 13 builds (Since Failed#26 ) Took 48 sec. Error Message 2030 did not equal 6496 Stacktrace sbt.ForkMain$ForkError: 2030 did not equal 6496 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply$mcV$sp(InputOutputMetricsSuite.scala:135) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfter$$super$runTest(InputOutputMetricsSuite.scala:46) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.metrics.InputOutputMetricsSuite.runTest(InputOutputMetricsSuite.scala:46) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfterAll$$super$run(InputOutputMetricsSuite.scala:46) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) at
[jira] [Resolved] (SPARK-5664) Restore stty settings when exiting for launching spark-shell from SBT
[ https://issues.apache.org/jira/browse/SPARK-5664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5664. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4451 [https://github.com/apache/spark/pull/4451] Restore stty settings when exiting for launching spark-shell from SBT - Key: SPARK-5664 URL: https://issues.apache.org/jira/browse/SPARK-5664 Project: Spark Issue Type: Bug Components: Build Reporter: Liang-Chi Hsieh Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3242) Spark 1.0.2 ec2 scripts creates clusters with Spark 1.0.1 installed by default
[ https://issues.apache.org/jira/browse/SPARK-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3242. -- Resolution: Fixed Resolved by https://github.com/apache/spark/commit/444ccdd80ec5df249978d8498b4fc501cc3429d7 Spark 1.0.2 ec2 scripts creates clusters with Spark 1.0.1 installed by default -- Key: SPARK-3242 URL: https://issues.apache.org/jira/browse/SPARK-3242 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.0.2 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5311) EventLoggingListener throws exception if log directory does not exist
[ https://issues.apache.org/jira/browse/SPARK-5311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5311. Resolution: Won't Fix Assignee: Josh Rosen EventLoggingListener throws exception if log directory does not exist - Key: SPARK-5311 URL: https://issues.apache.org/jira/browse/SPARK-5311 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker If the log directory does not exist, EventLoggingListener throws an IllegalArgumentException. Here's a simple reproduction (using the master branch (1.3.0)): {code} ./bin/spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/tmp/nonexistent-dir {code} where /tmp/nonexistent-dir is a directory that doesn't exist and /tmp exists. This results in the following exception: {code} 15/01/18 17:10:44 INFO HttpServer: Starting HTTP Server 15/01/18 17:10:44 INFO Utils: Successfully started service 'HTTP file server' on port 62729. 15/01/18 17:10:44 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 15/01/18 17:10:44 INFO Utils: Successfully started service 'SparkUI' on port 4041. 15/01/18 17:10:44 INFO SparkUI: Started SparkUI at http://joshs-mbp.att.net:4041 15/01/18 17:10:45 INFO Executor: Using REPL class URI: http://192.168.1.248:62726 15/01/18 17:10:45 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkdri...@joshs-mbp.att.net:62728/user/HeartbeatReceiver 15/01/18 17:10:45 INFO NettyBlockTransferService: Server created on 62730 15/01/18 17:10:45 INFO BlockManagerMaster: Trying to register BlockManager 15/01/18 17:10:45 INFO BlockManagerMasterActor: Registering block manager localhost:62730 with 265.4 MB RAM, BlockManagerId(driver, localhost, 62730) 15/01/18 17:10:45 INFO BlockManagerMaster: Registered BlockManager java.lang.IllegalArgumentException: Log directory /tmp/nonexistent-dir does not exist. at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:90) at org.apache.spark.SparkContext.init(SparkContext.scala:363) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:123) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:270) at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:60) at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:147) at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:60) at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106) at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:60) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:962) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) at
[jira] [Commented] (SPARK-5589) Split pyspark/sql.py into multiple files
[ https://issues.apache.org/jira/browse/SPARK-5589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312781#comment-14312781 ] Apache Spark commented on SPARK-5589: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4479 Split pyspark/sql.py into multiple files Key: SPARK-5589 URL: https://issues.apache.org/jira/browse/SPARK-5589 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu sql.py have more than 2k LOCs, it should be splitted into multiple modules. Also, put all data types into pyspark.sql.types -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5695) Check GBT caching logic
Xiangrui Meng created SPARK-5695: Summary: Check GBT caching logic Key: SPARK-5695 URL: https://issues.apache.org/jira/browse/SPARK-5695 Project: Spark Issue Type: Task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng A user reported that `t(n) = t(n-1) + const`, which may caused by cached RDDs being kicked out during training. We want to check whether there exists space to improve the caching logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4900) MLlib SingularValueDecomposition ARPACK IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Beyer updated SPARK-4900: -- Affects Version/s: 1.2.1 MLlib SingularValueDecomposition ARPACK IllegalStateException -- Key: SPARK-4900 URL: https://issues.apache.org/jira/browse/SPARK-4900 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1, 1.2.0, 1.2.1 Environment: Ubuntu 1410, Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode) spark local mode Reporter: Mike Beyer java.lang.reflect.InvocationTargetException ... Caused by: java.lang.IllegalStateException: ARPACK returns non-zero info = 3 Please refer ARPACK user guide for error message. at org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:120) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:235) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:171) ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4900) MLlib SingularValueDecomposition ARPACK IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312795#comment-14312795 ] Sean Owen commented on SPARK-4900: -- So I think there is at least a small problem in the error reporting: {code} info.`val` match { case 1 = throw new IllegalStateException(ARPACK returns non-zero info = + info.`val` + Maximum number of iterations taken. (Refer ARPACK user guide for details)) case 2 = throw new IllegalStateException(ARPACK returns non-zero info = + info.`val` + No shifts could be applied. Try to increase NCV. + (Refer ARPACK user guide for details)) case _ = throw new IllegalStateException(ARPACK returns non-zero info = + info.`val` + Please refer ARPACK user guide for error message.) } {code} Really, what's called case 2 here corresponds to return value 3, which is what you get. {code} = 0: Normal exit. = 1: Maximum number of iterations taken. All possible eigenvalues of OP has been found. IPARAM(5) returns the number of wanted converged Ritz values. = 2: No longer an informational error. Deprecated starting with release 2 of ARPACK. = 3: No shifts could be applied during a cycle of the Implicitly restarted Arnoldi iteration. One possibility is to increase the size of NCV relative to NEV. See remark 4 below. {code} I can fix the error message. Remark 4 that it refers to is: {code} 4. At present there is no a-priori analysis to guide the selection of NCV relative to NEV. The only formal requrement is that NCV NEV. However, it is recommended that NCV .ge. 2*NEV. If many problems of the same type are to be solved, one should experiment with increasing NCV while keeping NEV fixed for a given test problem. This will usually decrease the required number of OP*x operations but it also increases the work and storage required to maintain the orthogonal basis vectors. The optimal cross-over with respect to CPU time is problem dependent and must be determined empirically. {code} So I think that translates to k is too big. Is the matrix low-rank? In any event this is ultimately breeze code and I'm not sure if there's much that will done in Spark itself. MLlib SingularValueDecomposition ARPACK IllegalStateException -- Key: SPARK-4900 URL: https://issues.apache.org/jira/browse/SPARK-4900 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1, 1.2.0, 1.2.1 Environment: Ubuntu 1410, Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode) spark local mode Reporter: Mike Beyer java.lang.reflect.InvocationTargetException ... Caused by: java.lang.IllegalStateException: ARPACK returns non-zero info = 3 Please refer ARPACK user guide for error message. at org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:120) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:235) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:171) ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5683) Improve the json serialization for DataFrame API
Cheng Hao created SPARK-5683: Summary: Improve the json serialization for DataFrame API Key: SPARK-5683 URL: https://issues.apache.org/jira/browse/SPARK-5683 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5683) Improve the json serialization for DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311923#comment-14311923 ] Apache Spark commented on SPARK-5683: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/4468 Improve the json serialization for DataFrame API Key: SPARK-5683 URL: https://issues.apache.org/jira/browse/SPARK-5683 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5684) Key not found exception is thrown in case location of added partition to a parquet table is different than a path containing the partition values
[ https://issues.apache.org/jira/browse/SPARK-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yash Datta updated SPARK-5684: -- Priority: Major (was: Critical) Key not found exception is thrown in case location of added partition to a parquet table is different than a path containing the partition values - Key: SPARK-5684 URL: https://issues.apache.org/jira/browse/SPARK-5684 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.0 Reporter: Yash Datta Fix For: 1.3.0 Create a partitioned parquet table : create table test_table (dummy string) partitioned by (timestamp bigint) stored as parquet; Add a partition to the table and specify a different location: alter table test_table add partition (timestamp=9) location '/data/pth/different' Run a simple select * query we get an exception : 15/02/09 08:27:25 ERROR thriftserver.SparkSQLDriver: Failed in [select * from db4_mi2mi_binsrc1_default limit 5] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 21, localhost): java .util.NoSuchElementException: key not found: timestamp at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:141) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:128) at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:247) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) This happens because in parquet path it is assumed that (key=value) patterns are present in the partition location, which is not always the case! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5685) Show warning when users open text files compressed with non-splittable algorithms like gzip
Nicholas Chammas created SPARK-5685: --- Summary: Show warning when users open text files compressed with non-splittable algorithms like gzip Key: SPARK-5685 URL: https://issues.apache.org/jira/browse/SPARK-5685 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Nicholas Chammas Priority: Minor This is a usability or user-friendliness issue. It's extremely common for people to load a text file compressed with gzip, process it, and then wonder why only 1 core in their cluster is doing any work. Some examples: * http://stackoverflow.com/q/28127119/877069 * http://stackoverflow.com/q/27531816/877069 I'm not sure how this problem can be generalized, but at the very least it would be helpful if Spark displayed some kind of warning in the common case when someone opens a gzipped file with {{sc.textFile}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5710) Combines two adjacent `Cast` expressions into one
[ https://issues.apache.org/jira/browse/SPARK-5710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313729#comment-14313729 ] Apache Spark commented on SPARK-5710: - User 'guowei2' has created a pull request for this issue: https://github.com/apache/spark/pull/4497 Combines two adjacent `Cast` expressions into one - Key: SPARK-5710 URL: https://issues.apache.org/jira/browse/SPARK-5710 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: guowei A plan after `analyzer` with `typeCoercionRules` may produce many `cast` expressions. we can combine the adjacent ones. For example. create table test(a decimal(3,1)); explain select * from test where a*2-11; == Physical Plan == Filter (CAST(CAST((CAST(CAST((CAST(a#5, DecimalType()) * 2), DecimalType(21,1)), DecimalType()) - 1), DecimalType(22,1)), DecimalType()) 1) HiveTableScan [a#5], (MetastoreRelation default, test, None), None -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5711) Sort Shuffle performance issues about using AppendOnlyMap for large data sets
Sun Fulin created SPARK-5711: Summary: Sort Shuffle performance issues about using AppendOnlyMap for large data sets Key: SPARK-5711 URL: https://issues.apache.org/jira/browse/SPARK-5711 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: hbase-0.98.6-cdh5.2.0 phoenix-4.2.2 Reporter: Sun Fulin Recently we had caught performance issues when using spark 1.2.0 to read data from hbase and do some summary work. Our scenario means to : read large data sets from hbase (maybe 100G+ file) , form hbaseRDD, transform to schemardd, groupby and aggregate the data while got fewer new summary data sets, loading data into hbase (phoenix). Our major issue lead to : aggregate large datasets to get summary data sets would consume too long time (1 hour +) , while that should be supposed not so bad performance. We got the dump file attached and stacktrace from jstack like the following: From the stacktrace and dump file we can identify that processing large datasets would cause frequent AppendOnlyMap growing, and leading to huge map entrysize. We had referenced the source code of org.apache.spark.util.collection.AppendOnlyMap and found that the map had been initialized with capacity of 64. That would be too small for our use case. Thread 22432: (state = IN_JAVA) - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 (Compiled frame; information may be imprecise) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() @bci=1, line=38 (Interpreted frame) - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, line=198 (Compiled frame) - org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=201, line=145 (Compiled frame) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=3, line=32 (Compiled frame) - org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator) @bci=141, line=205 (Compiled frame) - org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator) @bci=74, line=58 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) @bci=169, line=68 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) @bci=2, line=41 (Interpreted frame) - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted frame) - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame) Thread 22431: (state = IN_JAVA) - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 (Compiled frame; information may be imprecise) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() @bci=1, line=38 (Interpreted frame) - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, line=198 (Compiled frame) - org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=201, line=145 (Compiled frame) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=3, line=32 (Compiled frame) - org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator) @bci=141, line=205 (Compiled frame) - org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator) @bci=74, line=58 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) @bci=169, line=68 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) @bci=2, line=41 (Interpreted frame) - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted frame) - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5700) Bump jets3t version from 0.9.2 to 0.9.3 in hadoop-2.3 and hadoop-2.4 profiles
[ https://issues.apache.org/jira/browse/SPARK-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313746#comment-14313746 ] Josh Rosen commented on SPARK-5700: --- Looks like 0.9.3 is now on Maven Central: http://search.maven.org/#artifactdetails%7Cnet.java.dev.jets3t%7Cjets3t%7C0.9.3%7Cjar Bump jets3t version from 0.9.2 to 0.9.3 in hadoop-2.3 and hadoop-2.4 profiles - Key: SPARK-5700 URL: https://issues.apache.org/jira/browse/SPARK-5700 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Labels: flaky-test This is a follow-up ticket for SPARK-5671 and SPARK-5696. JetS3t 0.9.2 contains a log4j.properties file inside the artifact and breaks our tests (see SPARK-5696). This is fixed in 0.9.3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5704) createDataFrame replace applySchema/inferSchema
[ https://issues.apache.org/jira/browse/SPARK-5704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313763#comment-14313763 ] Apache Spark commented on SPARK-5704: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4498 createDataFrame replace applySchema/inferSchema --- Key: SPARK-5704 URL: https://issues.apache.org/jira/browse/SPARK-5704 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4655) Split Stage into ShuffleMapStage and ResultStage subclasses
[ https://issues.apache.org/jira/browse/SPARK-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4655: -- Target Version/s: 1.4.0 (was: 1.3.0) Assignee: Ilya Ganelin (was: Josh Rosen) Hi [~ilganeli], feel free to work on this; I've assigned this JIRA to you. Split Stage into ShuffleMapStage and ResultStage subclasses --- Key: SPARK-4655 URL: https://issues.apache.org/jira/browse/SPARK-4655 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Josh Rosen Assignee: Ilya Ganelin The scheduler's {{Stage}} class has many fields which are only applicable to result stages or shuffle map stages. As a result, I think that it makes sense to make {{Stage}} into an abstract base class with two subclasses, {{ResultStage}} and {{ShuffleMapStage}}. This would improve the understandability of the DAGScheduler code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5678) DataFrame.to_pandas
[ https://issues.apache.org/jira/browse/SPARK-5678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312568#comment-14312568 ] Apache Spark commented on SPARK-5678: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4476 DataFrame.to_pandas Key: SPARK-5678 URL: https://issues.apache.org/jira/browse/SPARK-5678 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Davies Liu to_pandas to convert a DataFrame into a Pandas DataFrame. Note that the whole DataFrame API should still work even when Pandas is not available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
[ https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4267. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Sean Owen Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later -- Key: SPARK-4267 URL: https://issues.apache.org/jira/browse/SPARK-4267 Project: Spark Issue Type: Bug Components: YARN Reporter: Tsuyoshi OZAWA Assignee: Sean Owen Priority: Blocker Fix For: 1.3.0 Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this: {code} ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0 {code} Then Spark on YARN fails to launch jobs with NPE. {code} $ bin/spark-shell --master yarn-client scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2); java.lang.NullPointerException at org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284) at org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291) at org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480) at $iwC$$iwC$$iwC$$iwC.init(console:13) at $iwC$$iwC$$iwC.init(console:18) at $iwC$$iwC.init(console:20) at $iwC.init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
[jira] [Commented] (SPARK-5679) Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method
[ https://issues.apache.org/jira/browse/SPARK-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312600#comment-14312600 ] Kostas Sakellis commented on SPARK-5679: I have tried to repo this in a number of different ways and failed: 1. On mac with sbt and without sbt. 2. On centos 6x using sbt and without sbt 3. On ubuntu using sbt and without sbt Yet it is reproducible on the build machines but only for hadoop 2.2. As [~joshrosen] pointed out, it might be some shared state that is specific to older versions of hadoop. How do we feel about changing this test suite so that it doesn't use a shared spark context. I know it will slow down the test a bit but might be the easiest way. Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method -- Key: SPARK-5679 URL: https://issues.apache.org/jira/browse/SPARK-5679 Project: Spark Issue Type: Bug Components: Spark Core, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Kostas Sakellis Priority: Blocker Please audit these and see if there are any assumptions with respect to File IO that might not hold in all cases. I'm happy to help if you can't find anything. These both failed in the same run: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-SBT/38/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink {code} org.apache.spark.metrics.InputOutputMetricsSuite.input metrics with mixed read method Failing for the past 13 builds (Since Failed#26 ) Took 48 sec. Error Message 2030 did not equal 6496 Stacktrace sbt.ForkMain$ForkError: 2030 did not equal 6496 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply$mcV$sp(InputOutputMetricsSuite.scala:135) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfter$$super$runTest(InputOutputMetricsSuite.scala:46) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.metrics.InputOutputMetricsSuite.runTest(InputOutputMetricsSuite.scala:46) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at
[jira] [Updated] (SPARK-5690) Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion
[ https://issues.apache.org/jira/browse/SPARK-5690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5690: - Affects Version/s: 1.3.0 Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion - Key: SPARK-5690 URL: https://issues.apache.org/jira/browse/SPARK-5690 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Andrew Or Labels: flaky-test https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/1647/testReport/junit/org.apache.spark.deploy.rest/StandaloneRestSubmitSuite/simple_submit_until_completion/ {code} org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion Failing for the past 1 build (Since Failed#1647 ) Took 30 sec. Error Message Driver driver-20150209035158- did not finish within 30 seconds. Stacktrace sbt.ForkMain$ForkError: Driver driver-20150209035158- did not finish within 30 seconds. at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$class.fail(Assertions.scala:1328) at org.scalatest.FunSuite.fail(FunSuite.scala:1555) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$apache$spark$deploy$rest$StandaloneRestSubmitSuite$$waitUntilFinished(StandaloneRestSubmitSuite.scala:152) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply$mcV$sp(StandaloneRestSubmitSuite.scala:57) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StandaloneRestSubmitSuite.scala:41) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.runTest(StandaloneRestSubmitSuite.scala:41) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterAll$$super$run(StandaloneRestSubmitSuite.scala:41) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
[jira] [Updated] (SPARK-5690) Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion
[ https://issues.apache.org/jira/browse/SPARK-5690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5690: - Priority: Critical (was: Major) Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion - Key: SPARK-5690 URL: https://issues.apache.org/jira/browse/SPARK-5690 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Andrew Or Priority: Critical Labels: flaky-test https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/1647/testReport/junit/org.apache.spark.deploy.rest/StandaloneRestSubmitSuite/simple_submit_until_completion/ {code} org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion Failing for the past 1 build (Since Failed#1647 ) Took 30 sec. Error Message Driver driver-20150209035158- did not finish within 30 seconds. Stacktrace sbt.ForkMain$ForkError: Driver driver-20150209035158- did not finish within 30 seconds. at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$class.fail(Assertions.scala:1328) at org.scalatest.FunSuite.fail(FunSuite.scala:1555) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$apache$spark$deploy$rest$StandaloneRestSubmitSuite$$waitUntilFinished(StandaloneRestSubmitSuite.scala:152) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply$mcV$sp(StandaloneRestSubmitSuite.scala:57) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StandaloneRestSubmitSuite.scala:41) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.runTest(StandaloneRestSubmitSuite.scala:41) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterAll$$super$run(StandaloneRestSubmitSuite.scala:41) at
[jira] [Created] (SPARK-5690) Flaky test:
Patrick Wendell created SPARK-5690: -- Summary: Flaky test: Key: SPARK-5690 URL: https://issues.apache.org/jira/browse/SPARK-5690 Project: Spark Issue Type: Bug Components: Tests Reporter: Patrick Wendell Assignee: Andrew Or https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/1647/testReport/junit/org.apache.spark.deploy.rest/StandaloneRestSubmitSuite/simple_submit_until_completion/ {code} org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion Failing for the past 1 build (Since Failed#1647 ) Took 30 sec. Error Message Driver driver-20150209035158- did not finish within 30 seconds. Stacktrace sbt.ForkMain$ForkError: Driver driver-20150209035158- did not finish within 30 seconds. at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$class.fail(Assertions.scala:1328) at org.scalatest.FunSuite.fail(FunSuite.scala:1555) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$apache$spark$deploy$rest$StandaloneRestSubmitSuite$$waitUntilFinished(StandaloneRestSubmitSuite.scala:152) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply$mcV$sp(StandaloneRestSubmitSuite.scala:57) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StandaloneRestSubmitSuite.scala:41) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.runTest(StandaloneRestSubmitSuite.scala:41) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterAll$$super$run(StandaloneRestSubmitSuite.scala:41) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.run(StandaloneRestSubmitSuite.scala:41) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) at
[jira] [Updated] (SPARK-5690) Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion
[ https://issues.apache.org/jira/browse/SPARK-5690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5690: --- Summary: Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion (was: Flaky test: ) Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion - Key: SPARK-5690 URL: https://issues.apache.org/jira/browse/SPARK-5690 Project: Spark Issue Type: Bug Components: Tests Reporter: Patrick Wendell Assignee: Andrew Or Labels: flaky-test https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/1647/testReport/junit/org.apache.spark.deploy.rest/StandaloneRestSubmitSuite/simple_submit_until_completion/ {code} org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion Failing for the past 1 build (Since Failed#1647 ) Took 30 sec. Error Message Driver driver-20150209035158- did not finish within 30 seconds. Stacktrace sbt.ForkMain$ForkError: Driver driver-20150209035158- did not finish within 30 seconds. at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$class.fail(Assertions.scala:1328) at org.scalatest.FunSuite.fail(FunSuite.scala:1555) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$apache$spark$deploy$rest$StandaloneRestSubmitSuite$$waitUntilFinished(StandaloneRestSubmitSuite.scala:152) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply$mcV$sp(StandaloneRestSubmitSuite.scala:57) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StandaloneRestSubmitSuite.scala:41) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.runTest(StandaloneRestSubmitSuite.scala:41) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterAll$$super$run(StandaloneRestSubmitSuite.scala:41) at
[jira] [Updated] (SPARK-5689) Document what can be run in different YARN modes
[ https://issues.apache.org/jira/browse/SPARK-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5689: --- Issue Type: Documentation (was: Improvement) Document what can be run in different YARN modes Key: SPARK-5689 URL: https://issues.apache.org/jira/browse/SPARK-5689 Project: Spark Issue Type: Documentation Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves We should document what can be run in the different yarn modes. For instances, the interactive shell only work in yarn client mode, recently with https://github.com/apache/spark/pull/3976 users can run python scripts in cluster mode, etc.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1142) Allow adding jars on app submission, outside of code
[ https://issues.apache.org/jira/browse/SPARK-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312608#comment-14312608 ] Patrick Wendell commented on SPARK-1142: This already exists - you can use the --jars flag to spark-submit or set 'spark.jars' manually. Allow adding jars on app submission, outside of code Key: SPARK-1142 URL: https://issues.apache.org/jira/browse/SPARK-1142 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 0.9.0 Reporter: Sandy Pérez González Assignee: Sandy Ryza yarn-standalone mode supports an option that allows adding jars that will be distributed on the cluster with job submission. Providing similar functionality for other app submission modes will allow the spark-app script proposed in SPARK-1126 to support an add-jars option that works for every submit mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5685) Show warning when users open text files compressed with non-splittable algorithms like gzip
[ https://issues.apache.org/jira/browse/SPARK-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312622#comment-14312622 ] Josh Rosen commented on SPARK-5685: --- [~nchammas], in general I'm a huge fan of runtime warnings / better exceptions, especially for common issues like this. I wonder if it would be too noisy to log a warning every time textFile was used on compressed input; instead, what do you think about logging only in cases where minPartitions 1 and the input is compressed? This would cover the case where a user sees that they're not obtaining sufficient parallelism and then tries to increase the parallelism. Also, what happens if the user specifies the path of a directory that contains many input files, some of which are compressed and some which aren't? Does the driver know whether the files are compressed in an unsplitable way, or do we only discover this on the executors once the job runs? Show warning when users open text files compressed with non-splittable algorithms like gzip --- Key: SPARK-5685 URL: https://issues.apache.org/jira/browse/SPARK-5685 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Nicholas Chammas Priority: Minor This is a usability or user-friendliness issue. It's extremely common for people to load a text file compressed with gzip, process it, and then wonder why only 1 core in their cluster is doing any work. Some examples: * http://stackoverflow.com/q/28127119/877069 * http://stackoverflow.com/q/27531816/877069 I'm not sure how this problem can be generalized, but at the very least it would be helpful if Spark displayed some kind of warning in the common case when someone opens a gzipped file with {{sc.textFile}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5343) ShortestPaths traverses backwards
[ https://issues.apache.org/jira/browse/SPARK-5343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312591#comment-14312591 ] Brennon York commented on SPARK-5343: - I'll take this issue, thanks. ShortestPaths traverses backwards - Key: SPARK-5343 URL: https://issues.apache.org/jira/browse/SPARK-5343 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.0 Reporter: Michael Malak GraphX ShortestPaths seems to be following edges backwards instead of forwards: import org.apache.spark.graphx._ val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L, lib.ShortestPaths.run(g,Array(3)).vertices.collect res1: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 - 0)), (2,Map())) lib.ShortestPaths.run(g,Array(1)).vertices.collect res2: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), (3,Map(1 - 2)), (2,Map(1 - 1))) The following changes may be what will make it run forward: Change one occurrence of src to dst in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64 Change three occurrences of dst to src in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3355) Allow running maven tests in run-tests
[ https://issues.apache.org/jira/browse/SPARK-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312588#comment-14312588 ] Brennon York commented on SPARK-3355: - I've started this and should have the fix up shortly. It currently leverages the AMPLAB_JENKINS_BUILD_TOOL environment var though and wanted to sync to see, since this is duplicated now, if that still makes sense. Allow running maven tests in run-tests -- Key: SPARK-3355 URL: https://issues.apache.org/jira/browse/SPARK-3355 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell We should have a variable called AMPLAB_JENKINS_BUILD_TOOL that decides whether to run sbt or maven. This would allow us to simplify our build matrix in Jenkins... currently the maven builds run a totally different thing than the normal run-tests builds. The maven build currently does something like this: {code} mvn -DskipTests -Pprofile1 -Pprofile2 ... clean package mvn test -Pprofile1 -Pprofile2 ... --fail-at-end {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5691) Preventing duplicate registering of an application has incorrect logic
[ https://issues.apache.org/jira/browse/SPARK-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312632#comment-14312632 ] Matt Cheah commented on SPARK-5691: --- I've determined that this is a pretty simple bug in the Master code. I'm on commit hash 0793ee1b4dea1f4b0df749e8ad7c1ab70b512faf. In Master.scala, in the registerApplication method, it checks if the application is already registered by checking the addressToWorker data structure. In reality, this is wrong - it should examine the addressToApp data structure. I'll submit a pull request shortly. Preventing duplicate registering of an application has incorrect logic -- Key: SPARK-5691 URL: https://issues.apache.org/jira/browse/SPARK-5691 Project: Spark Issue Type: Bug Affects Versions: 1.1.1, 1.2.0 Reporter: Matt Cheah Fix For: 1.3.0 If an application registers twice with the Master, the Master accepts both copies and they both show up in the UI and consume resources. This is incorrect behavior. This happens inadvertently in regular usage when the Master is under high load, but it boils down to: when an application times out registering with the master and sends a second registration message, but the Master is still alive, it processes the first registration message for the application but also erroneously processes the second registration message instead of discarding it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4600) org.apache.spark.graphx.VertexRDD.diff does not work
[ https://issues.apache.org/jira/browse/SPARK-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-4600: Assignee: Brennon York org.apache.spark.graphx.VertexRDD.diff does not work Key: SPARK-4600 URL: https://issues.apache.org/jira/browse/SPARK-4600 Project: Spark Issue Type: Bug Components: GraphX Environment: scala 2.10.4 spark 1.1.0 Reporter: Teppei Tosa Assignee: Brennon York Labels: graphx VertexRDD.diff doesn't work. For example : val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 2L).map(id = (id, id.toInt))) setA.collect.foreach(println(_)) // (0,0) // (1,1) val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(1L until 3L).map(id = (id, id.toInt))) setB.collect.foreach(println(_)) // (1,1) // (2,2) val diff = setA.diff(setB) diff.collect.foreach(println(_)) // printed none -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5691) Preventing duplicate registering of an application has incorrect logic
[ https://issues.apache.org/jira/browse/SPARK-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312644#comment-14312644 ] Apache Spark commented on SPARK-5691: - User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/4477 Preventing duplicate registering of an application has incorrect logic -- Key: SPARK-5691 URL: https://issues.apache.org/jira/browse/SPARK-5691 Project: Spark Issue Type: Bug Affects Versions: 1.1.1, 1.2.0 Reporter: Matt Cheah Fix For: 1.3.0 If an application registers twice with the Master, the Master accepts both copies and they both show up in the UI and consume resources. This is incorrect behavior. This happens inadvertently in regular usage when the Master is under high load, but it boils down to: when an application times out registering with the master and sends a second registration message, but the Master is still alive, it processes the first registration message for the application but also erroneously processes the second registration message instead of discarding it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo
[ https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313254#comment-14313254 ] Nicholas Chammas commented on SPARK-5676: - It ended up in Mesos because [Spark itself came out of Mesos|https://github.com/mesos/spark]. It's just history. The Mesos project does not manage that repo at all. Anyway, we should add a license for clarity sake since we accept contributions there, regardless of whether those scripts are being distributed or not. I think we agree on that. If you don't want an issue to track that here, that's fine. No big deal either way, honestly. License missing from spark-ec2 repo --- Key: SPARK-5676 URL: https://issues.apache.org/jira/browse/SPARK-5676 Project: Spark Issue Type: Bug Components: EC2 Reporter: Florian Verhein There is no LICENSE file or licence headers in the code in the spark-ec2 repo. Also, I believe there is no contributor license agreement notification in place (like there is in the main spark repo). It would be great to fix this (sooner better than later while contributors list is small), so that users wishing to use this part of Spark are not in doubt over licensing issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo
[ https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313260#comment-14313260 ] Sean Owen commented on SPARK-5676: -- Ah you're saying it isn't even part of Mesos. I see why there's not obviously a good home in JIRA then. Can that repo just be managed as its own self-contained project then, using github issues? License missing from spark-ec2 repo --- Key: SPARK-5676 URL: https://issues.apache.org/jira/browse/SPARK-5676 Project: Spark Issue Type: Bug Components: EC2 Reporter: Florian Verhein There is no LICENSE file or licence headers in the code in the spark-ec2 repo. Also, I believe there is no contributor license agreement notification in place (like there is in the main spark repo). It would be great to fix this (sooner better than later while contributors list is small), so that users wishing to use this part of Spark are not in doubt over licensing issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo
[ https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313269#comment-14313269 ] Shivaram Venkataraman commented on SPARK-5676: -- Yes - it is managed as a self-contained project. However bugs in that project are often experienced by Spark users, so we end up with issues created here. I think filing these issues under the component EC2 is a fine thing to do as it does affect Spark usage on EC2. License missing from spark-ec2 repo --- Key: SPARK-5676 URL: https://issues.apache.org/jira/browse/SPARK-5676 Project: Spark Issue Type: Bug Components: EC2 Reporter: Florian Verhein There is no LICENSE file or licence headers in the code in the spark-ec2 repo. Also, I believe there is no contributor license agreement notification in place (like there is in the main spark repo). It would be great to fix this (sooner better than later while contributors list is small), so that users wishing to use this part of Spark are not in doubt over licensing issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5648) support alter ... unset tblproperties(key)
[ https://issues.apache.org/jira/browse/SPARK-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5648. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4424 [https://github.com/apache/spark/pull/4424] support alter ... unset tblproperties(key) --- Key: SPARK-5648 URL: https://issues.apache.org/jira/browse/SPARK-5648 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: DoingDone9 Fix For: 1.3.0 make hivecontext support unset tblproperties(key) like : alter view viewName unset tblproperties(k) alter table tableName unset tblproperties(k) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo
[ https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313271#comment-14313271 ] Nicholas Chammas commented on SPARK-5676: - Yeah, AFAIK it has nothing to do with Mesos apart from that historical connection. I think having it stand alone using GitHub issues is fine, though the question remains of what account it should fall under. Maybe spark-ec2/spark-ec2 (kinda like [boot2docker/boot2docker|https://github.com/boot2docker/boot2docker])? Anyway, this is a separate question from the one raised here. But yeah, it should probably be moved at some point. License missing from spark-ec2 repo --- Key: SPARK-5676 URL: https://issues.apache.org/jira/browse/SPARK-5676 Project: Spark Issue Type: Bug Components: EC2 Reporter: Florian Verhein There is no LICENSE file or licence headers in the code in the spark-ec2 repo. Also, I believe there is no contributor license agreement notification in place (like there is in the main spark repo). It would be great to fix this (sooner better than later while contributors list is small), so that users wishing to use this part of Spark are not in doubt over licensing issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org