[jira] [Resolved] (SPARK-4791) Create SchemaRDD from case classes with multiple constructors
[ https://issues.apache.org/jira/browse/SPARK-4791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4791. - Resolution: Fixed Assignee: Joseph K. Bradley > Create SchemaRDD from case classes with multiple constructors > - > > Key: SPARK-4791 > URL: https://issues.apache.org/jira/browse/SPARK-4791 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Issue: One can usually take an RDD of case classes and create a SchemaRDD, > where Spark SQL infers the schema from the case class metadata. However, if > the case class has multiple constructors, then ScalaReflection.schemaFor gets > confused. > Motivation: In spark.ml, I would like to create a class with the following > signature: > ``` > case class LabeledPoint(label: Double, features: Vector, weight: Double) { > def this(label: Double, features: Vector) = this(label, features, 1.0) > } > ``` > Proposed fix: Change ScalaReflection.schemaFor so it checks for whether there > are multiple constructors. If there are multiple ones, it should take the > primary constructor. This will not change the behavior of existing code > since it currently only supports case classes with 1 constructor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4827) Max iterations (100) reached for batch Resolution with deeply nested projects and project *s
[ https://issues.apache.org/jira/browse/SPARK-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242271#comment-14242271 ] Apache Spark commented on SPARK-4827: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/3674 > Max iterations (100) reached for batch Resolution with deeply nested projects > and project *s > > > Key: SPARK-4827 > URL: https://issues.apache.org/jira/browse/SPARK-4827 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4827) Max iterations (100) reached for batch Resolution with deeply nested projects and project *s
Michael Armbrust created SPARK-4827: --- Summary: Max iterations (100) reached for batch Resolution with deeply nested projects and project *s Key: SPARK-4827 URL: https://issues.apache.org/jira/browse/SPARK-4827 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD
[ https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242177#comment-14242177 ] 宿荣全 edited comment on SPARK-4817 at 12/11/14 7:15 AM: -- [~srowen] Always call {{foreachRDD}}, and operate on all of the RDD, and then call {{take}} on the RDD to get a few elements to print. It can achieve the effect, but it is more complicated. {code:title=example NO.1:|borderStyle=solid} val dstream = stream.map->filter->..foreachRDD(rdd => { val array = rdd.collect var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) {code} {code:title=example NO.2:|borderStyle=solid} val dstream = stream.map->filter->foreachRDD(rdd => { val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, String)]) => iter.toArray) val array = Array.concat(rddarray: _*) var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) {code} This two samples can achieve the effect. From the design perspective streaming direct manipulation of the RDD is not a good design.and I think the method {{foreachRDD}} is generally not used in coding. Generally when streaming register action by through the following 5 methods. Those methods all called method {{foreachRDD}}. # {color:blue}DStream.foreach{color} # {color:blue}DStream.saveAsObjectFiles{color} # {color:blue}DStream.saveAsTextFiles{color} # {color:blue}PairDStreamFunctions.saveAsHadoopFiles{color} # {color:blue}PairDStreamFunctions.saveAsNewAPIHadoopFiles{color} was (Author: surq): [~srowen] Always call {{foreachRDD}}, and operate on all of the RDD, and then call {{take}} on the RDD to get a few elements to print. It can achieve the effect, but it is more complicated. {code:title=for example|borderStyle=solid}: 1.val dstream = stream.map->filter->..foreachRDD(rdd => { val array = rdd.collect var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) 2.val dstream = stream.map->filter->foreachRDD(rdd => { val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, String)]) => iter.toArray) val array = Array.concat(rddarray: _*) var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) {code} This two samples can achieve the effect. From the design perspective streaming direct manipulation of the RDD is not a good design.and I think the method {{foreachRDD}} is generally not used in coding. Generally when streaming register action by through the following 6 methods. Those methods all called method {{foreachRDD}}. {color:blue} # DStream.foreach # DStream.saveAsObjectFiles # DStream.saveAsTextFiles # PairDStreamFunctions.saveAsHadoopFiles # PairDStreamFunctions.saveAsNewAPIHadoopFiles {color} > [streaming]Print the specified number of data and handle all of the elements > in RDD > --- > > Key: SPARK-4817 > URL: https://issues.apache.org/jira/browse/SPARK-4817 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: 宿荣全 >Priority: Minor > > Dstream.print function:Print 10 elements and handle 11 elements. > A new function based on Dstream.print function is presented: > the new function: > Print the specified number of data and handle all of the elements in RDD. > there is a work scene: > val dstream = stream.map->filter->mapPartitions->print > the data after filter need update database in mapPartitions,but don't need > print each data,only need to print the top 20 for view the data processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4825) CTAS fails to resolve when created using saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242259#comment-14242259 ] Apache Spark commented on SPARK-4825: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/3673 > CTAS fails to resolve when created using saveAsTable > > > Key: SPARK-4825 > URL: https://issues.apache.org/jira/browse/SPARK-4825 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Michael Armbrust >Assignee: Cheng Hao >Priority: Critical > > While writing a test for a different issue, I found that saveAsTable seems to > be broken: > {code} > test("save join to table") { > val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, > i.toString)) > sql("CREATE TABLE test1 (key INT, value STRING)") > testData.insertInto("test1") > sql("CREATE TABLE test2 (key INT, value STRING)") > testData.insertInto("test2") > testData.insertInto("test2") > sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = > b.key").saveAsTable("test") > checkAnswer( > table("test"), > sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = > b.key").collect().toSeq) > } > > sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = > b.key").saveAsTable("test") > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved > plan found, tree: > 'CreateTableAsSelect None, test, false, None > Aggregate [], [COUNT(value#336) AS _c0#334L] > Join Inner, Some((key#335 = key#339)) >MetastoreRelation default, test1, Some(a) >MetastoreRelation default, test2, Some(b) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4790) Flaky test in ReceivedBlockTrackerSuite: "block addition, block to batch allocation, and cleanup with write ahead log"
[ https://issues.apache.org/jira/browse/SPARK-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242250#comment-14242250 ] Josh Rosen commented on SPARK-4790: --- Actually, scratch that theory: here's a failure in the Master SBT build: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/1206/ > Flaky test in ReceivedBlockTrackerSuite: "block addition, block to batch > allocation, and cleanup with write ahead log" > -- > > Key: SPARK-4790 > URL: https://issues.apache.org/jira/browse/SPARK-4790 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Tathagata Das > Labels: flaky-test > > Found another flaky streaming test, > "org.apache.spark.streaming.ReceivedBlockTrackerSuite.block addition, block > to batch allocation and cleanup with write ahead log": > {code} > Error Message > File /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist. > Stacktrace > sbt.ForkMain$ForkError: File > /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist. > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) > at > org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:324) > at > org.apache.spark.streaming.util.WriteAheadLogSuite$.getLogFilesInDirectory(WriteAheadLogSuite.scala:344) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite.getWriteAheadLogFiles(ReceivedBlockTrackerSuite.scala:248) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply$mcV$sp(ReceivedBlockTrackerSuite.scala:173) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite.org$scalatest$BeforeAndAfter$$super$runTest(ReceivedBlockTrackerSuite.scala:41) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite.runTest(ReceivedBlockTrackerSuite.scala:41) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite.org$scalatest$BeforeAndAfter$$super$run(ReceivedBlockTrackerSuite.scala:41) > at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) > a
[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"
[ https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242248#comment-14242248 ] Josh Rosen commented on SPARK-4826: --- Actually, it turns out that this has happened twice. Here's a link to another occurrence of the same failure pattern: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1147/ > Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: > "java.lang.IllegalStateException: File exists and there is no append support!" > > > Key: SPARK-4826 > URL: https://issues.apache.org/jira/browse/SPARK-4826 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0, 1.3.0 >Reporter: Josh Rosen >Assignee: Tathagata Das > Labels: flaky-test > > I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite > where four tests failed with the same exception. > [Link to test result (this will eventually > break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/]. > In case that link breaks: > The failed tests: > {code} > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in block manager, not in write ahead log > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in write ahead log, not in block manager > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in write ahead log, and test storing in block manager > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > with partially available in block manager, and rest in write ahead log > {code} > The error messages are all (essentially) the same: > {code} > java.lang.IllegalStateException: File exists and there is no append > support! > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$Su
[jira] [Created] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"
Josh Rosen created SPARK-4826: - Summary: Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!" Key: SPARK-4826 URL: https://issues.apache.org/jira/browse/SPARK-4826 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0, 1.3.0 Reporter: Josh Rosen Assignee: Tathagata Das I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite where four tests failed with the same exception. [Link to test result (this will eventually break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/]. In case that link breaks: The failed tests: {code} org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data available only in block manager, not in write ahead log org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data available only in write ahead log, not in block manager org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data available only in write ahead log, and test storing in block manager org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data with partially available in block manager, and rest in write ahead log {code} The error messages are all (essentially) the same: {code} java.lang.IllegalStateException: File exists and there is no append support! at org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33) at org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34) at org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34) at org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLi
[jira] [Updated] (SPARK-1600) flaky test case in streaming.CheckpointSuite
[ https://issues.apache.org/jira/browse/SPARK-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-1600: -- Labels: flaky-test (was: ) > flaky test case in streaming.CheckpointSuite > > > Key: SPARK-1600 > URL: https://issues.apache.org/jira/browse/SPARK-1600 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.2.0, 1.3.0 >Reporter: Nan Zhu > Labels: flaky-test > > the case "recovery with file input stream.recovery with file input stream " > sometimes fails when the Jenkins is very busy with an unrelated change > I have met it for 3 times, I also saw it in other places, > the latest example is in > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14397/ > where the modification is just in YARN related files > I once reported in dev mail list: > http://apache-spark-developers-list.1001551.n3.nabble.com/a-weird-test-case-in-Streaming-td6116.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4790) Flaky test in ReceivedBlockTrackerSuite: "block addition, block to batch allocation, and cleanup with write ahead log"
[ https://issues.apache.org/jira/browse/SPARK-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4790: -- Labels: flaky-test (was: ) > Flaky test in ReceivedBlockTrackerSuite: "block addition, block to batch > allocation, and cleanup with write ahead log" > -- > > Key: SPARK-4790 > URL: https://issues.apache.org/jira/browse/SPARK-4790 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Tathagata Das > Labels: flaky-test > > Found another flaky streaming test, > "org.apache.spark.streaming.ReceivedBlockTrackerSuite.block addition, block > to batch allocation and cleanup with write ahead log": > {code} > Error Message > File /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist. > Stacktrace > sbt.ForkMain$ForkError: File > /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist. > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) > at > org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:324) > at > org.apache.spark.streaming.util.WriteAheadLogSuite$.getLogFilesInDirectory(WriteAheadLogSuite.scala:344) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite.getWriteAheadLogFiles(ReceivedBlockTrackerSuite.scala:248) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply$mcV$sp(ReceivedBlockTrackerSuite.scala:173) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite.org$scalatest$BeforeAndAfter$$super$runTest(ReceivedBlockTrackerSuite.scala:41) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite.runTest(ReceivedBlockTrackerSuite.scala:41) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite.org$scalatest$BeforeAndAfter$$super$run(ReceivedBlockTrackerSuite.scala:41) > at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite.run(ReceivedBlockTrackerSuite.scala:41) > at > org.scalatest.tools.Framework.org$scalatest$tools$Fram
[jira] [Commented] (SPARK-4790) Flaky test in ReceivedBlockTrackerSuite: "block addition, block to batch allocation, and cleanup with write ahead log"
[ https://issues.apache.org/jira/browse/SPARK-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242243#comment-14242243 ] Josh Rosen commented on SPARK-4790: --- Curiously, it looks like this test hasn't failed recently in the pull request builder ([link|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24351/testReport/junit/org.apache.spark.streaming/ReceivedBlockTrackerSuite/block_addition__block_to_batch_allocation_and_cleanup_with_write_ahead_log/history/]). However, it looks like it has failed on-and-off in the Maven builds ([link|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1132/hadoop.version=1.0.4,label=centos/testReport/junit/org.apache.spark.streaming/ReceivedBlockTrackerSuite/block_addition__block_to_batch_allocation_and_cleanup_with_write_ahead_log/history/]). Perhaps the problem could be related to test suite initialization working differently in Maven than SBT? Just a theory; maybe it actually _has_ failed in the PRB and the recent test history is just a lucky streak. > Flaky test in ReceivedBlockTrackerSuite: "block addition, block to batch > allocation, and cleanup with write ahead log" > -- > > Key: SPARK-4790 > URL: https://issues.apache.org/jira/browse/SPARK-4790 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Tathagata Das > > Found another flaky streaming test, > "org.apache.spark.streaming.ReceivedBlockTrackerSuite.block addition, block > to batch allocation and cleanup with write ahead log": > {code} > Error Message > File /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist. > Stacktrace > sbt.ForkMain$ForkError: File > /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist. > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) > at > org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:324) > at > org.apache.spark.streaming.util.WriteAheadLogSuite$.getLogFilesInDirectory(WriteAheadLogSuite.scala:344) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite.getWriteAheadLogFiles(ReceivedBlockTrackerSuite.scala:248) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply$mcV$sp(ReceivedBlockTrackerSuite.scala:173) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite.org$scalatest$BeforeAndAfter$$super$runTest(ReceivedBlockTrackerSuite.scala:41) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) > at > org.apache.spark.streaming.ReceivedBlockTrackerSuite.runTest(ReceivedBlockTrackerSuite.scala:41) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTe
[jira] [Comment Edited] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD
[ https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242177#comment-14242177 ] 宿荣全 edited comment on SPARK-4817 at 12/11/14 7:03 AM: -- [~srowen] Always call {{foreachRDD}}, and operate on all of the RDD, and then call {{take}} on the RDD to get a few elements to print. It can achieve the effect, but it is more complicated. {code:title=for example|borderStyle=solid}: 1.val dstream = stream.map->filter->..foreachRDD(rdd => { val array = rdd.collect var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) 2.val dstream = stream.map->filter->foreachRDD(rdd => { val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, String)]) => iter.toArray) val array = Array.concat(rddarray: _*) var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) {code} This two samples can achieve the effect. From the design perspective streaming direct manipulation of the RDD is not a good design.and I think the method {{foreachRDD}} is generally not used in coding. Generally when streaming register action by through the following 6 methods. Those methods all called method {{foreachRDD}}. {color:blue} # DStream.foreach # DStream.saveAsObjectFiles # DStream.saveAsTextFiles # PairDStreamFunctions.saveAsHadoopFiles # PairDStreamFunctions.saveAsNewAPIHadoopFiles {color} was (Author: surq): [~srowen] Always call {{foreachRDD}}, and operate on all of the RDD, and then call {{take }} on the RDD to get a few elements to print. It can achieve the effect, but it is more complicated. for example: 1.val dstream = stream.map->filter->..foreachRDD(rdd => { val array = rdd.collect; var result = Array[(String,String)](); result = if (array.size > 5) array.take(5) else array.take(array.size); result foreach println; }) 2.val dstream = stream.map->filter->foreachRDD(rdd => { val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, String)]) => iter.toArray); val array = Array.concat(rddarray: _*); var result = Array[(String,String)](); result = if (array.size > 5) array.take(5) else array.take(array.size); result foreach println; }) This two samples can achieve the effect. From the design perspective streaming direct manipulation of the RDD is not a good design.and I think the method {{foreachRDD}} is generally not used in coding. Generally when streaming register action by through the following 6 methods. Those methods all called method {{foreachRDD}}. * {color:blue} # DStream.foreach # DStream.saveAsObjectFiles # DStream.saveAsTextFiles # PairDStreamFunctions.saveAsHadoopFiles # PairDStreamFunctions.saveAsNewAPIHadoopFiles {color} * > [streaming]Print the specified number of data and handle all of the elements > in RDD > --- > > Key: SPARK-4817 > URL: https://issues.apache.org/jira/browse/SPARK-4817 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: 宿荣全 >Priority: Minor > > Dstream.print function:Print 10 elements and handle 11 elements. > A new function based on Dstream.print function is presented: > the new function: > Print the specified number of data and handle all of the elements in RDD. > there is a work scene: > val dstream = stream.map->filter->mapPartitions->print > the data after filter need update database in mapPartitions,but don't need > print each data,only need to print the top 20 for view the data processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD
[ https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242177#comment-14242177 ] 宿荣全 edited comment on SPARK-4817 at 12/11/14 6:59 AM: -- [~srowen] Always call {{foreachRDD}}, and operate on all of the RDD, and then call {{take }} on the RDD to get a few elements to print. It can achieve the effect, but it is more complicated. for example: 1.val dstream = stream.map->filter->..foreachRDD(rdd => { val array = rdd.collect; var result = Array[(String,String)](); result = if (array.size > 5) array.take(5) else array.take(array.size); result foreach println; }) 2.val dstream = stream.map->filter->foreachRDD(rdd => { val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, String)]) => iter.toArray); val array = Array.concat(rddarray: _*); var result = Array[(String,String)](); result = if (array.size > 5) array.take(5) else array.take(array.size); result foreach println; }) This two samples can achieve the effect. From the design perspective streaming direct manipulation of the RDD is not a good design.and I think the method {{foreachRDD}} is generally not used in coding. Generally when streaming register action by through the following 6 methods. Those methods all called method {{foreachRDD}}. * {color:blue} # DStream.foreach # DStream.saveAsObjectFiles # DStream.saveAsTextFiles # PairDStreamFunctions.saveAsHadoopFiles # PairDStreamFunctions.saveAsNewAPIHadoopFiles {color} * was (Author: surq): [~srowen] Always call foreachRDD, and operate on all of the RDD, and then call take on the RDD to get a few elements to print. It can achieve the effect, but it is more complicated. for example: 1.val dstream = stream.map->filter->..foreachRDD(rdd => { val array = rdd.collect var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) 2.val dstream = stream.map->filter->foreachRDD(rdd => { val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, String)]) => iter.toArray) val array = Array.concat(rddarray: _*) var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) This two samples can achieve the effect. From the design perspective streaming direct manipulation of the RDD is not a good design.and I think the method 'foreachRDD' is generally not used in coding. Generally when streaming register action by through the following 6 methods. Those methods all called method 'foreachRDD'. 1.DStream.foreach 2.DStream.saveAsObjectFiles 3.DStream.saveAsTextFiles 4.PairDStreamFunctions.saveAsHadoopFiles 5.PairDStreamFunctions.saveAsNewAPIHadoopFiles > [streaming]Print the specified number of data and handle all of the elements > in RDD > --- > > Key: SPARK-4817 > URL: https://issues.apache.org/jira/browse/SPARK-4817 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: 宿荣全 >Priority: Minor > > Dstream.print function:Print 10 elements and handle 11 elements. > A new function based on Dstream.print function is presented: > the new function: > Print the specified number of data and handle all of the elements in RDD. > there is a work scene: > val dstream = stream.map->filter->mapPartitions->print > the data after filter need update database in mapPartitions,but don't need > print each data,only need to print the top 20 for view the data processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4700) Add Http support to Spark Thrift server
[ https://issues.apache.org/jira/browse/SPARK-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242218#comment-14242218 ] Judy Nash commented on SPARK-4700: -- pull request created at https://github.com/apache/spark/pull/3672 > Add Http support to Spark Thrift server > --- > > Key: SPARK-4700 > URL: https://issues.apache.org/jira/browse/SPARK-4700 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.3.0, 1.2.1 > Environment: Linux and Windows >Reporter: Judy Nash > Original Estimate: 48h > Remaining Estimate: 48h > > Currently thrift only supports TCP connection. > The JIRA is to add HTTP support to spark thrift server in addition to the TCP > protocol. Both TCP and HTTP are supported by Hive today. HTTP is more secure > and used often in Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4700) Add Http support to Spark Thrift server
[ https://issues.apache.org/jira/browse/SPARK-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242219#comment-14242219 ] Apache Spark commented on SPARK-4700: - User 'judynash' has created a pull request for this issue: https://github.com/apache/spark/pull/3672 > Add Http support to Spark Thrift server > --- > > Key: SPARK-4700 > URL: https://issues.apache.org/jira/browse/SPARK-4700 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.3.0, 1.2.1 > Environment: Linux and Windows >Reporter: Judy Nash > Original Estimate: 48h > Remaining Estimate: 48h > > Currently thrift only supports TCP connection. > The JIRA is to add HTTP support to spark thrift server in addition to the TCP > protocol. Both TCP and HTTP are supported by Hive today. HTTP is more secure > and used often in Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4700) Add Http support to Spark Thrift server
[ https://issues.apache.org/jira/browse/SPARK-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Judy Nash updated SPARK-4700: - Affects Version/s: 1.3.0 > Add Http support to Spark Thrift server > --- > > Key: SPARK-4700 > URL: https://issues.apache.org/jira/browse/SPARK-4700 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.3.0, 1.2.1 > Environment: Linux and Windows >Reporter: Judy Nash > Original Estimate: 48h > Remaining Estimate: 48h > > Currently thrift only supports TCP connection. > The JIRA is to add HTTP support to spark thrift server in addition to the TCP > protocol. Both TCP and HTTP are supported by Hive today. HTTP is more secure > and used often in Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4720) Remainder should also return null if the divider is 0.
[ https://issues.apache.org/jira/browse/SPARK-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4720: Target Version/s: 1.3.0 (was: 1.2.0) > Remainder should also return null if the divider is 0. > -- > > Key: SPARK-4720 > URL: https://issues.apache.org/jira/browse/SPARK-4720 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > This is a follow-up of SPARK-4593. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4742) The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded
[ https://issues.apache.org/jira/browse/SPARK-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4742: Target Version/s: 1.3.0 (was: 1.2.0) > The name of Parquet File generated by AppendingParquetOutputFormat should be > zero padded > > > Key: SPARK-4742 > URL: https://issues.apache.org/jira/browse/SPARK-4742 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 >Reporter: Sasaki Toru >Priority: Minor > > When I use Parquet File as a output file using > ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded > while RDD#saveAsText does zero padding. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4554) Set fair scheduler pool for JDBC client session in hive 13
[ https://issues.apache.org/jira/browse/SPARK-4554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4554. - Resolution: Duplicate > Set fair scheduler pool for JDBC client session in hive 13 > -- > > Key: SPARK-4554 > URL: https://issues.apache.org/jira/browse/SPARK-4554 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangfei >Priority: Critical > Fix For: 1.2.0 > > > Now hive 13 shim does not support to set fair scheduler pool -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD
[ https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242177#comment-14242177 ] 宿荣全 edited comment on SPARK-4817 at 12/11/14 6:19 AM: -- [~srowen] Always call foreachRDD, and operate on all of the RDD, and then call take on the RDD to get a few elements to print. It can achieve the effect, but it is more complicated. for example: 1.val dstream = stream.map->filter->..foreachRDD(rdd => { val array = rdd.collect var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) 2.val dstream = stream.map->filter->foreachRDD(rdd => { val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, String)]) => iter.toArray) val array = Array.concat(rddarray: _*) var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) This two samples can achieve the effect. From the design perspective streaming direct manipulation of the RDD is not a good design.and I think the method 'foreachRDD' is generally not used in coding. Generally when streaming register action by through the following 6 methods. Those methods all called method 'foreachRDD'. 1.DStream.foreach 2.DStream.saveAsObjectFiles 3.DStream.saveAsTextFiles 4.PairDStreamFunctions.saveAsHadoopFiles 5.PairDStreamFunctions.saveAsNewAPIHadoopFiles was (Author: surq): [~srowen] Always call foreachRDD, and operate on all of the RDD, and then call take on the RDD to get a few elements to print. It can achieve the effect, but it is more complicated. for example: ``` 1.val dstream = stream.map->filter->..foreachRDD(rdd => { val array = rdd.collect var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) 2.val dstream = stream.map->filter->foreachRDD(rdd => { val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, String)]) => iter.toArray) val array = Array.concat(rddarray: _*) var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) ``` This two samples can achieve the effect. From the design perspective streaming direct manipulation of the RDD is not a good design.and I think the method 'foreachRDD' is generally not used in coding. Generally when streaming register action by through the following 6 methods. Those methods all called method 'foreachRDD'. 1.DStream.foreach 2.DStream.saveAsObjectFiles 3.DStream.saveAsTextFiles 4.PairDStreamFunctions.saveAsHadoopFiles 5.PairDStreamFunctions.saveAsNewAPIHadoopFiles > [streaming]Print the specified number of data and handle all of the elements > in RDD > --- > > Key: SPARK-4817 > URL: https://issues.apache.org/jira/browse/SPARK-4817 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: 宿荣全 >Priority: Minor > > Dstream.print function:Print 10 elements and handle 11 elements. > A new function based on Dstream.print function is presented: > the new function: > Print the specified number of data and handle all of the elements in RDD. > there is a work scene: > val dstream = stream.map->filter->mapPartitions->print > the data after filter need update database in mapPartitions,but don't need > print each data,only need to print the top 20 for view the data processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4554) Set fair scheduler pool for JDBC client session in hive 13
[ https://issues.apache.org/jira/browse/SPARK-4554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4554: Priority: Critical (was: Major) > Set fair scheduler pool for JDBC client session in hive 13 > -- > > Key: SPARK-4554 > URL: https://issues.apache.org/jira/browse/SPARK-4554 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangfei >Priority: Critical > Fix For: 1.2.0 > > > Now hive 13 shim does not support to set fair scheduler pool -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet
[ https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3575: Target Version/s: 1.3.0 (was: 1.2.0) > Hive Schema is ignored when using convertMetastoreParquet > - > > Key: SPARK-3575 > URL: https://issues.apache.org/jira/browse/SPARK-3575 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Critical > > This can cause problems when for example one of the columns is defined as > TINYINT. A class cast exception will be thrown since the parquet table scan > produces INTs while the rest of the execution is expecting bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3860) Improve dimension joins
[ https://issues.apache.org/jira/browse/SPARK-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3860: Target Version/s: 1.3.0 (was: 1.2.0) > Improve dimension joins > --- > > Key: SPARK-3860 > URL: https://issues.apache.org/jira/browse/SPARK-3860 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > This is an umbrella ticket for improving performance for joining multiple > dimension tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4699) Make caseSensitive configurable in Analyzer.scala
[ https://issues.apache.org/jira/browse/SPARK-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4699: Target Version/s: 1.3.0 (was: 1.2.0) > Make caseSensitive configurable in Analyzer.scala > - > > Key: SPARK-4699 > URL: https://issues.apache.org/jira/browse/SPARK-4699 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Jacky Li > Fix For: 1.2.0 > > > Currently, case sensitivity is true by default in Analyzer. It should be > configurable by setting SQLConf in the client application -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4811) Custom UDTFs not working in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4811: Fix Version/s: (was: 1.2.0) > Custom UDTFs not working in Spark SQL > - > > Key: SPARK-4811 > URL: https://issues.apache.org/jira/browse/SPARK-4811 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0, 1.1.1 >Reporter: Saurabh Santhosh >Priority: Critical > > I am using the Thrift srever interface to Spark SQL and using beeline to > connect to it. > I tried Spark SQL versions 1.1.0 and 1.1.1 and both are throwing the > following exception when using any custom UDTF. > These are the steps i did : > *Created a UDTF 'com.x.y.xxx'.* > Registered the UDTF using following query : > *create temporary function xxx as 'com.x.y.xxx'* > The registration went through without any errors. But when i tried executing > the UDTF i got the following error. > *java.lang.ClassNotFoundException: xxx* > Funny thing is that Its trying to load the function name instead of the > funtion class. The exception is at *line no: 81 in hiveudfs.scala* > I have been at it for quite a long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3262) CREATE VIEW is not supported but the error message is not clear
[ https://issues.apache.org/jira/browse/SPARK-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3262. - Resolution: Duplicate Fix Version/s: 1.2.0 > CREATE VIEW is not supported but the error message is not clear > --- > > Key: SPARK-3262 > URL: https://issues.apache.org/jira/browse/SPARK-3262 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xiangrui Meng >Assignee: Michael Armbrust > Fix For: 1.2.0 > > > In my case, it throws a "table not found" error. If this is not supported, we > should throw an "unsupported" error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD
[ https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242193#comment-14242193 ] Yi Tian commented on SPARK-4817: Hi, [~sowen] I think the main idea of [~surq] is * For the people who just start to work on spark streaming (just like [~surq]), the {{print}} api is not clear enough for them to understand that only 10 lines of data will be handled. * In some cases, developers need a api to handle all the data and print the first few lines of all. > [streaming]Print the specified number of data and handle all of the elements > in RDD > --- > > Key: SPARK-4817 > URL: https://issues.apache.org/jira/browse/SPARK-4817 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: 宿荣全 >Priority: Minor > > Dstream.print function:Print 10 elements and handle 11 elements. > A new function based on Dstream.print function is presented: > the new function: > Print the specified number of data and handle all of the elements in RDD. > there is a work scene: > val dstream = stream.map->filter->mapPartitions->print > the data after filter need update database in mapPartitions,but don't need > print each data,only need to print the top 20 for view the data processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4825) CTAS fails to resolve when created using saveAsTable
Michael Armbrust created SPARK-4825: --- Summary: CTAS fails to resolve when created using saveAsTable Key: SPARK-4825 URL: https://issues.apache.org/jira/browse/SPARK-4825 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Michael Armbrust Assignee: Cheng Hao Priority: Critical While writing a test for a different issue, I found that saveAsTable seems to be broken: {code} test("save join to table") { val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString)) sql("CREATE TABLE test1 (key INT, value STRING)") testData.insertInto("test1") sql("CREATE TABLE test2 (key INT, value STRING)") testData.insertInto("test2") testData.insertInto("test2") sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").saveAsTable("test") checkAnswer( table("test"), sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").collect().toSeq) } sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").saveAsTable("test") org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved plan found, tree: 'CreateTableAsSelect None, test, false, None Aggregate [], [COUNT(value#336) AS _c0#334L] Join Inner, Some((key#335 = key#339)) MetastoreRelation default, test1, Some(a) MetastoreRelation default, test2, Some(b) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD
[ https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242177#comment-14242177 ] 宿荣全 edited comment on SPARK-4817 at 12/11/14 5:57 AM: -- [~srowen] Always call foreachRDD, and operate on all of the RDD, and then call take on the RDD to get a few elements to print. It can achieve the effect, but it is more complicated. for example: ``` 1.val dstream = stream.map->filter->..foreachRDD(rdd => { val array = rdd.collect var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) 2.val dstream = stream.map->filter->foreachRDD(rdd => { val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, String)]) => iter.toArray) val array = Array.concat(rddarray: _*) var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) ``` This two samples can achieve the effect. From the design perspective streaming direct manipulation of the RDD is not a good design.and I think the method 'foreachRDD' is generally not used in coding. Generally when streaming register action by through the following 6 methods. Those methods all called method 'foreachRDD'. 1.DStream.foreach 2.DStream.saveAsObjectFiles 3.DStream.saveAsTextFiles 4.PairDStreamFunctions.saveAsHadoopFiles 5.PairDStreamFunctions.saveAsNewAPIHadoopFiles was (Author: surq): [~srowen] Always call foreachRDD, and operate on all of the RDD, and then call take on the RDD to get a few elements to print.It can achieve the effect, but it is more complicated. for example: 1.val dstream = stream.map->filter->..foreachRDD(rdd => { val array = rdd.collect var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) 2.val dstream = stream.map->filter->foreachRDD(rdd => { val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, String)]) => iter.toArray) val array = Array.concat(rddarray: _*) var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) this two samples can achieve the effect. From the design perspective streaming direct manipulation of the RDD is not a good design.and I thank the method 'foreachRDD' is generally not used in coding. Generally when streaming register action by through the following 6 methods.Those methods all called method 'foreachRDD'. 1.DStream.foreach 2.DStream.saveAsObjectFiles 3.DStream.saveAsTextFiles 4.PairDStreamFunctions.saveAsHadoopFiles 5.PairDStreamFunctions.saveAsNewAPIHadoopFiles > [streaming]Print the specified number of data and handle all of the elements > in RDD > --- > > Key: SPARK-4817 > URL: https://issues.apache.org/jira/browse/SPARK-4817 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: 宿荣全 >Priority: Minor > > Dstream.print function:Print 10 elements and handle 11 elements. > A new function based on Dstream.print function is presented: > the new function: > Print the specified number of data and handle all of the elements in RDD. > there is a work scene: > val dstream = stream.map->filter->mapPartitions->print > the data after filter need update database in mapPartitions,but don't need > print each data,only need to print the top 20 for view the data processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD
[ https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242177#comment-14242177 ] 宿荣全 edited comment on SPARK-4817 at 12/11/14 5:35 AM: -- [~srowen] Always call foreachRDD, and operate on all of the RDD, and then call take on the RDD to get a few elements to print.It can achieve the effect, but it is more complicated. for example: 1.val dstream = stream.map->filter->..foreachRDD(rdd => { val array = rdd.collect var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) 2.val dstream = stream.map->filter->foreachRDD(rdd => { val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, String)]) => iter.toArray) val array = Array.concat(rddarray: _*) var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) this two samples can achieve the effect. From the design perspective streaming direct manipulation of the RDD is not a good design.and I thank the method 'foreachRDD' is generally not used in coding. Generally when streaming register action by through the following 6 methods.Those methods all called method 'foreachRDD'. 1.DStream.foreach 2.DStream.saveAsObjectFiles 3.DStream.saveAsTextFiles 4.PairDStreamFunctions.saveAsHadoopFiles 5.PairDStreamFunctions.saveAsNewAPIHadoopFiles was (Author: surq): Always call foreachRDD, and operate on all of the RDD, and then call take on the RDD to get a few elements to print.It can achieve the effect, but it is more complicated. for example: 1.val dstream = stream.map->filter->..foreachRDD(rdd => { val array = rdd.collect var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) 2.val dstream = stream.map->filter->foreachRDD(rdd => { val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, String)]) => iter.toArray) val array = Array.concat(rddarray: _*) var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) this two samples can achieve the effect. From the design perspective streaming direct manipulation of the RDD is not a good design.and I thank the method 'foreachRDD' is generally not used in coding. Generally when streaming register action by through the following 6 methods.Those methods all called method 'foreachRDD'. ①.DStream.foreach ②.DStream.saveAsObjectFiles ③.DStream.saveAsTextFiles ④.PairDStreamFunctions.saveAsHadoopFiles ⑤.PairDStreamFunctions.saveAsNewAPIHadoopFiles > [streaming]Print the specified number of data and handle all of the elements > in RDD > --- > > Key: SPARK-4817 > URL: https://issues.apache.org/jira/browse/SPARK-4817 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: 宿荣全 >Priority: Minor > > Dstream.print function:Print 10 elements and handle 11 elements. > A new function based on Dstream.print function is presented: > the new function: > Print the specified number of data and handle all of the elements in RDD. > there is a work scene: > val dstream = stream.map->filter->mapPartitions->print > the data after filter need update database in mapPartitions,but don't need > print each data,only need to print the top 20 for view the data processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD
[ https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242177#comment-14242177 ] 宿荣全 commented on SPARK-4817: Always call foreachRDD, and operate on all of the RDD, and then call take on the RDD to get a few elements to print.It can achieve the effect, but it is more complicated. for example: 1.val dstream = stream.map->filter->..foreachRDD(rdd => { val array = rdd.collect var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) 2.val dstream = stream.map->filter->foreachRDD(rdd => { val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, String)]) => iter.toArray) val array = Array.concat(rddarray: _*) var result = Array[(String,String)]() result = if (array.size > 5) array.take(5) else array.take(array.size) result foreach println }) this two samples can achieve the effect. From the design perspective streaming direct manipulation of the RDD is not a good design.and I thank the method 'foreachRDD' is generally not used in coding. Generally when streaming register action by through the following 6 methods.Those methods all called method 'foreachRDD'. ①.DStream.foreach ②.DStream.saveAsObjectFiles ③.DStream.saveAsTextFiles ④.PairDStreamFunctions.saveAsHadoopFiles ⑤.PairDStreamFunctions.saveAsNewAPIHadoopFiles > [streaming]Print the specified number of data and handle all of the elements > in RDD > --- > > Key: SPARK-4817 > URL: https://issues.apache.org/jira/browse/SPARK-4817 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: 宿荣全 >Priority: Minor > > Dstream.print function:Print 10 elements and handle 11 elements. > A new function based on Dstream.print function is presented: > the new function: > Print the specified number of data and handle all of the elements in RDD. > there is a work scene: > val dstream = stream.map->filter->mapPartitions->print > the data after filter need update database in mapPartitions,but don't need > print each data,only need to print the top 20 for view the data processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4700) Add Http support to Spark Thrift server
[ https://issues.apache.org/jira/browse/SPARK-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Judy Nash updated SPARK-4700: - Description: Currently thrift only supports TCP connection. The JIRA is to add HTTP support to spark thrift server in addition to the TCP protocol. Both TCP and HTTP are supported by Hive today. HTTP is more secure and used often in Windows. was: Currently thrift only supports TCP connection. The ask is to add HTTP connection as well. Both TCP and HTTP are supported by Hive today. > Add Http support to Spark Thrift server > --- > > Key: SPARK-4700 > URL: https://issues.apache.org/jira/browse/SPARK-4700 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.2.1 > Environment: Linux and Windows >Reporter: Judy Nash > Original Estimate: 48h > Remaining Estimate: 48h > > Currently thrift only supports TCP connection. > The JIRA is to add HTTP support to spark thrift server in addition to the TCP > protocol. Both TCP and HTTP are supported by Hive today. HTTP is more secure > and used often in Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242101#comment-14242101 ] Reynold Xin commented on SPARK-4740: I can't really think of a reason why the Netty one would performance worse than the old NIO one (maybe there are - but I can't think of it). Can you take a look at the io wait time to see if that is high? If not, then it is probably not seek time right? > Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey > > > Key: SPARK-4740 > URL: https://issues.apache.org/jira/browse/SPARK-4740 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.2.0 >Reporter: Zhang, Liye >Assignee: Reynold Xin > Attachments: (rxin patch better executor)TestRunner sort-by-key - > Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner > sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report > 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner > sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, > TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per > node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z > > > When testing current spark master (1.3.0-snapshot) with spark-perf > (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService > takes much longer time than NIO based shuffle transferService. The network > throughput of Netty is only about half of that of NIO. > We tested with standalone mode, and the data set we used for test is 20 > billion records, and the total size is about 400GB. Spark-perf test is > Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each > executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD
[ https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242090#comment-14242090 ] 宿荣全 commented on SPARK-4817: [~srowen] ' Neither prints the "top" elements. Did you mean "first"?' yes print first 'num' datas. print and foreachRDD ultimately call {new ForEachDStream(this, context.sparkContext.clean(foreachFunc)).register()}. The difference: print's 'foreachFunc' is defined by streaming,foreachRDD's 'foreachFunc' is defined by developer.I think this method that "always call foreachRDD, and operate on all of the RDD, and then call take on the RDD to get a few elements to print." is the same as print,and do print function only,don't handle all elements in RDD. for example: 1.val dstream = stream.map->filter->.foreachRDD(rdd => { val result = rdd.take(11) result foreach println }) 2.val dstream = stream.map->filter->print both of this two example all handle 11 datas,1 is println 11 datas. 2 is println 10datas and ""...". So if want to handle all elements in RDD and print 'num' datas I thank this patch is very convenient and necessary. > [streaming]Print the specified number of data and handle all of the elements > in RDD > --- > > Key: SPARK-4817 > URL: https://issues.apache.org/jira/browse/SPARK-4817 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: 宿荣全 >Priority: Minor > > Dstream.print function:Print 10 elements and handle 11 elements. > A new function based on Dstream.print function is presented: > the new function: > Print the specified number of data and handle all of the elements in RDD. > there is a work scene: > val dstream = stream.map->filter->mapPartitions->print > the data after filter need update database in mapPartitions,but don't need > print each data,only need to print the top 20 for view the data processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4687) SparkContext#addFile doesn't keep file folder information
[ https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242086#comment-14242086 ] Xuefu Zhang commented on SPARK-4687: I concur with [~sandyr]'s account for the need of addFolder() as it helps any Spark application from doing the same thing over and over again, possibly in a less performant way. Folder is such an natural, indispensible to some extent, extension to file in any system that deals with bits on at storage level. I'd also contend that being hard to get it right shouldn't prevent us from trying and perfecting it on the way if we believe functionally it's right thing to add. > SparkContext#addFile doesn't keep file folder information > - > > Key: SPARK-4687 > URL: https://issues.apache.org/jira/browse/SPARK-4687 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0 >Reporter: Jimmy Xiang > > Files added with SparkContext#addFile are loaded with Utils#fetchFile before > a task starts. However, Utils#fetchFile puts all files under the Spart root > on the worker node. We should have an option to keep the folder information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4818) Join operation should use iterator/lazy evaluation
[ https://issues.apache.org/jira/browse/SPARK-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242064#comment-14242064 ] Apache Spark commented on SPARK-4818: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/3671 > Join operation should use iterator/lazy evaluation > -- > > Key: SPARK-4818 > URL: https://issues.apache.org/jira/browse/SPARK-4818 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.1 >Reporter: Johannes Simon >Priority: Minor > > The current implementation of the join operation does not use an iterator > (i.e. lazy evaluation), causing it to explicitly evaluate the co-grouped > values. In big data applications, these value collections can be very large. > This causes the *cartesian product of all co-grouped values* for a specific > key of both RDDs to be kept in memory during the flatMapValues operation, > resulting in an *O(size(pair._1)*size(pair._2))* memory consumption instead > of *O(1)*. Very large value collections will therefore cause "GC overhead > limit exceeded" exceptions and fail the task, or at least slow down execution > dramatically. > {code:title=PairRDDFunctions.scala|borderStyle=solid} > //... > def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = > { > this.cogroup(other, partitioner).flatMapValues( pair => > for (v <- pair._1; w <- pair._2) yield (v, w) > ) > } > //... > {code} > Since cogroup returns an Iterable instance of an Array, the join > implementation could be changed to the following, which uses lazy evaluation > instead, and has almost no memory overhead: > {code:title=PairRDDFunctions.scala|borderStyle=solid} > //... > def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = > { > this.cogroup(other, partitioner).flatMapValues( pair => > for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w) > ) > } > //... > {code} > Alternatively, if the current implementation is intentionally not using lazy > evaluation for some reason, there could be a *lazyJoin()* method next to the > original join implementation that utilizes lazy evaluation. This of course > applies to other join operations as well. > Thanks! :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4824) Join should use `Iterator` rather than `Iterable`
[ https://issues.apache.org/jira/browse/SPARK-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu closed SPARK-4824. --- Resolution: Duplicate > Join should use `Iterator` rather than `Iterable` > - > > Key: SPARK-4824 > URL: https://issues.apache.org/jira/browse/SPARK-4824 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu > > In Scala, `map` and `flatMap` of `Iterable` will copy the contents of > `Iterable` to a new `Seq`. Such as, > {code} > val iterable = Seq(1, 2, 3).map(v => { > println(v) > v > }) > println("Iterable map done") > val iterator = Seq(1, 2, 3).iterator.map(v => { > println(v) > v > }) > println("Iterator map done") > {code} > outputed > {code} > 1 > 2 > 3 > Iterable map done > Iterator map done > {code} > So we should use 'iterator' to reduce memory consumed by join. > Found by [~johannes.simon] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4824) Join should use `Iterator` rather than `Iterable`
[ https://issues.apache.org/jira/browse/SPARK-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242056#comment-14242056 ] Apache Spark commented on SPARK-4824: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/3671 > Join should use `Iterator` rather than `Iterable` > - > > Key: SPARK-4824 > URL: https://issues.apache.org/jira/browse/SPARK-4824 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu > > In Scala, `map` and `flatMap` of `Iterable` will copy the contents of > `Iterable` to a new `Seq`. Such as, > {code} > val iterable = Seq(1, 2, 3).map(v => { > println(v) > v > }) > println("Iterable map done") > val iterator = Seq(1, 2, 3).iterator.map(v => { > println(v) > v > }) > println("Iterator map done") > {code} > outputed > {code} > 1 > 2 > 3 > Iterable map done > Iterator map done > {code} > So we should use 'iterator' to reduce memory consumed by join. > Found by [~johannes.simon] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4687) SparkContext#addFile doesn't keep file folder information
[ https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242055#comment-14242055 ] Sandy Ryza commented on SPARK-4687: --- I think [~xuefuz] can probably motivate this better, but from what I understand, the main use case is Hive's map joins and map bucket joins, in which a smaller table needs to be distributed to every node. The smaller table typically resides in HDFS, and is the output of a separate job. For map joins, the smaller table is composed of a bunch of files in a single folder. For map bucket joins, the smaller table is composed of a single folder with a bunch of bucket folders underneath, each containing a set of data files. At the very least, doing the prefixing would require a bunch of extra FS operations to rename all the subfiles. Though that might make them difficult to read from other Hive implementations? Another totally separate situation I encountered a while ago where this kind of thing would have been useful was calling http://ctakes.apache.org/ in a distributed fashion. Calling into it requires letting it load a bunch of files from a particular directory structure. We ultimately had to go with a workaround that required installing the directory on every node. Beyond the issues I outlined in my patch, are there particular edge cases you're worried about where we wouldn't be able to copy the behavior from addFile? > SparkContext#addFile doesn't keep file folder information > - > > Key: SPARK-4687 > URL: https://issues.apache.org/jira/browse/SPARK-4687 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0 >Reporter: Jimmy Xiang > > Files added with SparkContext#addFile are loaded with Utils#fetchFile before > a task starts. However, Utils#fetchFile puts all files under the Spart root > on the worker node. We should have an option to keep the folder information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4824) Join should use `Iterator` rather than `Iterable`
Shixiong Zhu created SPARK-4824: --- Summary: Join should use `Iterator` rather than `Iterable` Key: SPARK-4824 URL: https://issues.apache.org/jira/browse/SPARK-4824 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Shixiong Zhu In Scala, `map` and `flatMap` of `Iterable` will copy the contents of `Iterable` to a new `Seq`. Such as, {code} val iterable = Seq(1, 2, 3).map(v => { println(v) v }) println("Iterable map done") val iterator = Seq(1, 2, 3).iterator.map(v => { println(v) v }) println("Iterator map done") {code} outputed {code} 1 2 3 Iterable map done Iterator map done {code} So we should use 'iterator' to reduce memory consumed by join. Found by [~johannes.simon] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel
[ https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242034#comment-14242034 ] Debasish Das commented on SPARK-4675: - [~josephkb] how do we validate that low dimension space is giving more meaningful similarities than the feature space (which is sparse) ? > Find similar products and similar users in MatrixFactorizationModel > --- > > Key: SPARK-4675 > URL: https://issues.apache.org/jira/browse/SPARK-4675 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Steven Bourke >Priority: Trivial > Labels: mllib, recommender > > Using the latent feature space that is learnt in MatrixFactorizationModel, I > have added 2 new functions to find similar products and similar users. A user > of the API can for example pass a product ID, and get the closest products. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242031#comment-14242031 ] Debasish Das commented on SPARK-4823: - I am considering coming up with a baseline version that's very close to brute force but we cut the computation with a topK number...for each user come up with topK users where K is defined by the client..this will take care of matrix factorization use-case... Basically on master we collect a set of user factors, broadcast it to every node and does a reduceByKey to generate topK users for each user from this user block...We send a kernel function (cosine / polynomial / rbf) in this calculation... But this idea does not work for raw features right...If we do map features to a lower dimension using factorization then this approach should run fine...but I am not sure if we can ask users to map their data into a lower dimension Is it possible to bring in ideas from fastfood and kitchen sink to do this ? > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger
[ https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242030#comment-14242030 ] Michael Armbrust commented on SPARK-4814: - Hive throws a lot of warning here for some reason even when reading data from their own test files. One option here would be to just override the configuration only for the hive subproject. {code} // Hive throws assertion errors for tests that pass. javaOptions in Test := (javaOptions in Test).value.filterNot(_ == "-ea"), {code} > Enable assertions in SBT, Maven tests / AssertionError from Hive's > LazyBinaryInteger > > > Key: SPARK-4814 > URL: https://issues.apache.org/jira/browse/SPARK-4814 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.1.0 >Reporter: Sean Owen > > Follow up to SPARK-4159, wherein we noticed that Java tests weren't running > in Maven, in part because a Java test actually fails with {{AssertionError}}. > That code/test was fixed in SPARK-4850. > The reason it wasn't caught by SBT tests was that they don't run with > assertions on, and Maven's surefire does. > Turning on assertions in the SBT build is trivial, adding one line: > {code} > javaOptions in Test += "-ea", > {code} > This reveals a test failure in Scala test suites though: > {code} > [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds) > [info] Failed to execute query using catalyst: > [info] Error: Job aborted due to stage failure: Task 1 in stage 551.0 > failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, > localhost): java.lang.AssertionError > [info]at > org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51) > [info]at > org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110) > [info]at > org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171) > [info]at > org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166) > [info]at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318) > [info]at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314) > [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > [info]at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132) > [info]at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128) > [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) > [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) > [info]at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > [info]at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) > [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) > [info]at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > [info]at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) > [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) > [info]at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > [info]at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > [info]at org.apache.spark.scheduler.Task.run(Task.scala:56) > [info]at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) > [info]at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [info]at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [info]at java.lang.Thread.run(Thread.java:745) > {code} > The items for this JIRA are therefore: > - Enable assertions in SBT > - Fix this failure > - Figure out why Maven scalatest didn't trigger it - may need assertions > explicitly turned on too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4823) rowSimilarities
Reza Zadeh created SPARK-4823: - Summary: rowSimilarities Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows (> 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger
[ https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241983#comment-14241983 ] Cheng Lian commented on SPARK-4814: --- Thanks [~srowen]! I'll take a look. > Enable assertions in SBT, Maven tests / AssertionError from Hive's > LazyBinaryInteger > > > Key: SPARK-4814 > URL: https://issues.apache.org/jira/browse/SPARK-4814 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.1.0 >Reporter: Sean Owen > > Follow up to SPARK-4159, wherein we noticed that Java tests weren't running > in Maven, in part because a Java test actually fails with {{AssertionError}}. > That code/test was fixed in SPARK-4850. > The reason it wasn't caught by SBT tests was that they don't run with > assertions on, and Maven's surefire does. > Turning on assertions in the SBT build is trivial, adding one line: > {code} > javaOptions in Test += "-ea", > {code} > This reveals a test failure in Scala test suites though: > {code} > [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds) > [info] Failed to execute query using catalyst: > [info] Error: Job aborted due to stage failure: Task 1 in stage 551.0 > failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, > localhost): java.lang.AssertionError > [info]at > org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51) > [info]at > org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110) > [info]at > org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171) > [info]at > org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166) > [info]at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318) > [info]at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314) > [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > [info]at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132) > [info]at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128) > [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) > [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) > [info]at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > [info]at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) > [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) > [info]at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > [info]at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) > [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) > [info]at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > [info]at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > [info]at org.apache.spark.scheduler.Task.run(Task.scala:56) > [info]at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) > [info]at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [info]at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [info]at java.lang.Thread.run(Thread.java:745) > {code} > The items for this JIRA are therefore: > - Enable assertions in SBT > - Fix this failure > - Figure out why Maven scalatest didn't trigger it - may need assertions > explicitly turned on too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241973#comment-14241973 ] Saisai Shao edited comment on SPARK-4740 at 12/11/14 1:34 AM: -- Hi Reynold, the code I pasted is just the example, I do convert ByteBuffer to Netty ByteBuf. We will test again using your patch to see the difference. As we tested using ram disk to minimize the disk effect, the performance of Netty is similar to NIO, can we say that current implementation of Netty transfer service is not well tuned for slow random IO device like spinned disk? Also from my understanding I guess that the unbalanced situation might be introduced by random disk IO latency, since SSDs and ramdisk outperforms much better than spinned disk in random seek, which minimize the IO latency vibrance, and the difference of system configuration before and after is only the spinned disk to ram disk. So what is your opinion? was (Author: jerryshao): Hi Reynold, the code I pasted is just the example, I do convert ByteBuffer to Netty ByteBuf. We will test again using your patch to see the difference. As we tested using ram disk to minimize the disk effect, the performance of Netty is similar to NIO, can we say that current implementation of Netty transfer service is not well tuned for slow random IO device like spinned disk. Also from my understanding I guess that the unbalanced situation might be introduced by random disk IO latency, since SSDs and ramdisk outperforms much better than spinned disk in random seek, which minimize the IO latency vibrance, and the difference of system configuration before and after is only the spinned disk to ram disk. So what is your opinion? > Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey > > > Key: SPARK-4740 > URL: https://issues.apache.org/jira/browse/SPARK-4740 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.2.0 >Reporter: Zhang, Liye >Assignee: Reynold Xin > Attachments: (rxin patch better executor)TestRunner sort-by-key - > Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner > sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report > 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner > sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, > TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per > node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z > > > When testing current spark master (1.3.0-snapshot) with spark-perf > (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService > takes much longer time than NIO based shuffle transferService. The network > throughput of Netty is only about half of that of NIO. > We tested with standalone mode, and the data set we used for test is 20 > billion records, and the total size is about 400GB. Spark-perf test is > Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each > executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241973#comment-14241973 ] Saisai Shao commented on SPARK-4740: Hi Reynold, the code I pasted is just the example, I do convert ByteBuffer to Netty ByteBuf. We will test again using your patch to see the difference. As we tested using ram disk to minimize the disk effect, the performance of Netty is similar to NIO, can we say that current implementation of Netty transfer service is not well tuned for slow random IO device like spinned disk. Also from my understanding I guess that the unbalanced situation might be introduced by random disk IO latency, since SSDs and ramdisk outperforms much better than spinned disk in random seek, which minimize the IO latency vibrance, and the difference of system configuration before and after is only the spinned disk to ram disk. So what is your opinion? > Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey > > > Key: SPARK-4740 > URL: https://issues.apache.org/jira/browse/SPARK-4740 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.2.0 >Reporter: Zhang, Liye >Assignee: Reynold Xin > Attachments: (rxin patch better executor)TestRunner sort-by-key - > Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner > sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report > 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner > sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, > TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per > node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z > > > When testing current spark master (1.3.0-snapshot) with spark-perf > (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService > takes much longer time than NIO based shuffle transferService. The network > throughput of Netty is only about half of that of NIO. > We tested with standalone mode, and the data set we used for test is 20 > billion records, and the total size is about 400GB. Spark-perf test is > Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each > executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4687) SparkContext#addFile doesn't keep file folder information
[ https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241974#comment-14241974 ] Patrick Wendell commented on SPARK-4687: I commented a bit on the JIRA after seeing this. Instead of having this can Hive just add individual files with a prefix denoting what the folder is? Maintaining addFile has been nontrivial (we found a lot of corner cases over time) so it would be good to understand what the alternative is for Hive. > SparkContext#addFile doesn't keep file folder information > - > > Key: SPARK-4687 > URL: https://issues.apache.org/jira/browse/SPARK-4687 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0 >Reporter: Jimmy Xiang > > Files added with SparkContext#addFile are loaded with Utils#fetchFile before > a task starts. However, Utils#fetchFile puts all files under the Spart root > on the worker node. We should have an option to keep the folder information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4687) SparkContext#addFile doesn't keep file folder information
[ https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241961#comment-14241961 ] Apache Spark commented on SPARK-4687: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/3670 > SparkContext#addFile doesn't keep file folder information > - > > Key: SPARK-4687 > URL: https://issues.apache.org/jira/browse/SPARK-4687 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0 >Reporter: Jimmy Xiang > > Files added with SparkContext#addFile are loaded with Utils#fetchFile before > a task starts. However, Utils#fetchFile puts all files under the Spart root > on the worker node. We should have an option to keep the folder information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel
[ https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241952#comment-14241952 ] Joseph K. Bradley commented on SPARK-4675: -- Just to make sure I get your last question, are you asking, "Why compute product similarities using the low-dimensional space when we could do it in the high-dimensional space?" If so, then my understanding is that the low-dimensional space will give more meaningful similarities in general. > Find similar products and similar users in MatrixFactorizationModel > --- > > Key: SPARK-4675 > URL: https://issues.apache.org/jira/browse/SPARK-4675 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Steven Bourke >Priority: Trivial > Labels: mllib, recommender > > Using the latent feature space that is learnt in MatrixFactorizationModel, I > have added 2 new functions to find similar products and similar users. A user > of the API can for example pass a product ID, and get the closest products. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped
[ https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241951#comment-14241951 ] Ilayaperumal Gopinathan commented on SPARK-2892: To add more info: When the ReceiverTracker sends the "StopReceiver" message to the receiver actor at the executor, it's ReceiverLauncher thread always times out and I notice the corresponding job is cancelled only because of stopping the DAGScheduler. This throws the exception[1] while at the executor side the worker node throws this info[2] Exception[1]: INFO sparkDriver-akka.actor.default-dispatcher-14 cluster.SparkDeploySchedulerBackend - Asking each executor to shut down 15:06:53,783 1.1.0.SNAP INFO Thread-40 scheduler.DAGScheduler - Job 1 failed: start at SparkDriver.java:109, took 72.739141 s Exception in thread "Thread-40" org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702) at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:701) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1428) at akka.actor.Actor$class.aroundPostStop(Actor.scala:475) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundPostStop(DAGScheduler.scala:1375) at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210) at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172) at akka.actor.ActorCell.terminate(ActorCell.scala:369) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) info[2] INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40127.0.0.1%3A52219-2#1619424601] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/12/10 15:06:53 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@localhost:51262] <- [akka.tcp://sparkExecutor@localhost:52217]: Error [Shut down address: akka.tcp://sparkExecutor@localhost:52217] [ akka.remote.ShutDownAssociation: Shut down address: akka.tcp://sparkExecutor@localhost:52217 Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down. Please note that I am running everything on localhost and the same thing works fine on "local" mode and the above issue only arises on "cluster" mode. I tried changing the hostname to 127.0.0.1 but noticed the same. Any clues on what might be going on here would help a lot. Thanks! > Socket Receiver does not stop when streaming context is stopped > --- > > Key: SPARK-2892 > URL: https://issues.apache.org/jira/browse/SPARK-2892 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.0.2 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > Running NetworkWordCount with > {quote} > ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); > Thread.sleep(6) > {quote} > gives the following error > {quote} > 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) > in 10047 ms on localhost (1/1) > 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at > ReceiverTracker.scala:275) finished in 10.056 s > 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at > ReceiverTracker.scala:275, took 1
[jira] [Created] (SPARK-4822) Use sphinx tags for Python doc annotations
Joseph K. Bradley created SPARK-4822: Summary: Use sphinx tags for Python doc annotations Key: SPARK-4822 URL: https://issues.apache.org/jira/browse/SPARK-4822 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor Currently, pyspark documentation uses the same annotations as in Scala: {code} :: Experimental :: {code} It should use Sphinx annotations: {code} .. note:: Experimental {code} The latter method creates a box. The former method must either be on the same line as the rest of the doc text, or it generates compilation warnings. Proposal: Change all annotations in Python docs to use "note" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3526) Docs section on data locality
[ https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3526. Resolution: Fixed Fix Version/s: 1.2.0 Thanks [~aash] for contributing. > Docs section on data locality > - > > Key: SPARK-3526 > URL: https://issues.apache.org/jira/browse/SPARK-3526 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.0.2 >Reporter: Andrew Ash >Assignee: Andrew Ash > Fix For: 1.2.0 > > > Several threads on the mailing list have been about data locality and how to > interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc. Let's get some more > details in the docs on this concept so we can point future questions there. > A couple people appreciated the below description of locality so it could be > a good starting point: > {quote} > The locality is how close the data is to the code that's processing it. > PROCESS_LOCAL means data is in the same JVM as the code that's running, so > it's really fast. NODE_LOCAL might mean that the data is in HDFS on the same > node, or in another executor on the same node, so is a little slower because > the data has to travel across an IPC connection. RACK_LOCAL is even slower > -- data is on a different server so needs to be sent over the network. > Spark switches to lower locality levels when there's no unprocessed data on a > node that has idle CPUs. In that situation you have two options: wait until > the busy CPUs free up so you can start another task that uses data on that > server, or start a new task on a farther away server that needs to bring data > from that remote place. What Spark typically does is wait a bit in the hopes > that a busy CPU frees up. Once that timeout expires, it starts moving the > data from far away to the free CPU. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4633) Support gzip in spark.compression.io.codec
[ https://issues.apache.org/jira/browse/SPARK-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell closed SPARK-4633. -- Resolution: Won't Fix I'd like to close this issue for now until we get a better understanding of the needs. I'm not against adding more codecs, it would just be good to have some profiled reason why we are doing it since as I said, it has support overhead. We can re-open this if there is more interest. > Support gzip in spark.compression.io.codec > -- > > Key: SPARK-4633 > URL: https://issues.apache.org/jira/browse/SPARK-4633 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Reporter: Takeshi Yamamuro >Priority: Trivial > > gzip is widely used in other frameowrks such as hadoop mapreduce and tez, and > also > I think that gizip is more stable than other codecs in terms of both > performance > and space overheads. > I have one open question; current spark configuratios have a block size option > for each codec (spark.io.compression.[gzip|lz4|snappy].block.size). > As # of codecs increases, the configurations have more options and > I think that it is sort of complicated for non-expert users. > To mitigate it, my thought follows; > the three configurations are replaced with a single option for block size > (spark.io.compression.block.size). Then, 'Meaning' in configurations > will describe "This option makes an effect on gzip, lz4, and snappy. > Block size (in bytes) used in compression, in the case when these compression > codecs are used. Lowering...". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4821) pyspark.mllib.rand docs not generated correctly
[ https://issues.apache.org/jira/browse/SPARK-4821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241871#comment-14241871 ] Apache Spark commented on SPARK-4821: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/3669 > pyspark.mllib.rand docs not generated correctly > --- > > Key: SPARK-4821 > URL: https://issues.apache.org/jira/browse/SPARK-4821 > Project: Spark > Issue Type: Bug > Components: Documentation, MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > spark/python/docs/pyspark.mllib.rst needs to be updated to reflect the change > in package names from pyspark.mllib.random to .rand > Otherwise, the Python API docs are empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4759: - Fix Version/s: 1.1.2 1.3.0 > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Fix For: 1.3.0, 1.1.2 > > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4759: - Labels: backport-needed (was: ) > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Labels: backport-needed > Fix For: 1.3.0, 1.1.2 > > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4759: - Target Version/s: 1.3.0, 1.1.2, 1.2.1 > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Labels: backport-needed > Fix For: 1.3.0, 1.1.2 > > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4821) pyspark.mllib.rand docs not generated correctly
Joseph K. Bradley created SPARK-4821: Summary: pyspark.mllib.rand docs not generated correctly Key: SPARK-4821 URL: https://issues.apache.org/jira/browse/SPARK-4821 Project: Spark Issue Type: Bug Components: Documentation, MLlib, PySpark Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley spark/python/docs/pyspark.mllib.rst needs to be updated to reflect the change in package names from pyspark.mllib.random to .rand Otherwise, the Python API docs are empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4569) Rename "externalSorting" in Aggregator
[ https://issues.apache.org/jira/browse/SPARK-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4569: - Labels: backport-needed (was: ) > Rename "externalSorting" in Aggregator > -- > > Key: SPARK-4569 > URL: https://issues.apache.org/jira/browse/SPARK-4569 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.2.0 >Reporter: Sandy Ryza >Priority: Trivial > Labels: backport-needed > Fix For: 1.3.0 > > > While technically all spilling in Spark does result in sorting, calling this > variable externalSorting makes it seem like ExternalSorter will be used, when > in fact it just means whether spilling is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4569) Rename "externalSorting" in Aggregator
[ https://issues.apache.org/jira/browse/SPARK-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4569: - Target Version/s: 1.3.0, 1.1.2, 1.2.1 Fix Version/s: 1.3.0 > Rename "externalSorting" in Aggregator > -- > > Key: SPARK-4569 > URL: https://issues.apache.org/jira/browse/SPARK-4569 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.2.0 >Reporter: Sandy Ryza >Priority: Trivial > Fix For: 1.3.0 > > > While technically all spilling in Spark does result in sorting, calling this > variable externalSorting makes it seem like ExternalSorter will be used, when > in fact it just means whether spilling is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4820) Spark build encounters "File name too long" on some encrypted filesystems
[ https://issues.apache.org/jira/browse/SPARK-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4820: --- Description: This was reported by Luchesar Cekov on github along with a proposed fix. The fix has some potential downstream issues (it will modify the classnames) so until we understand better how many users are affected we aren't going to merge it. However, I'd like to include the issue and workaround here. If you encounter this issue please comment on the JIRA so we can assess the frequency. The issue produces this error: {code} [error] == Expanded type of tree == [error] [error] ConstantType(value = Constant(Throwable)) [error] [error] uncaught exception during compilation: java.io.IOException [error] File name too long [error] two errors found {code} The workaround is in maven under the compile options add: {code} + -Xmax-classfile-name + 128 {code} In SBT add: {code} +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"), {code} was: This was reported by Luchesar Cekov on github along with a proposed fix. The fix has some potential downstream issues (it will modify the classnames) so until we understand better how many users are affected we aren't going to merge it. However, I'd like to include the issue and workaround here. The issue produces this error: {code} [error] == Expanded type of tree == [error] [error] ConstantType(value = Constant(Throwable)) [error] [error] uncaught exception during compilation: java.io.IOException [error] File name too long [error] two errors found {code} The workaround is in maven under the compile options add: {code} + -Xmax-classfile-name + 128 {code} In SBT add: {code} +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"), {code} > Spark build encounters "File name too long" on some encrypted filesystems > - > > Key: SPARK-4820 > URL: https://issues.apache.org/jira/browse/SPARK-4820 > Project: Spark > Issue Type: Bug >Reporter: Patrick Wendell > > This was reported by Luchesar Cekov on github along with a proposed fix. The > fix has some potential downstream issues (it will modify the classnames) so > until we understand better how many users are affected we aren't going to > merge it. However, I'd like to include the issue and workaround here. If you > encounter this issue please comment on the JIRA so we can assess the > frequency. > The issue produces this error: > {code} > [error] == Expanded type of tree == > [error] > [error] ConstantType(value = Constant(Throwable)) > [error] > [error] uncaught exception during compilation: java.io.IOException > [error] File name too long > [error] two errors found > {code} > The workaround is in maven under the compile options add: > {code} > + -Xmax-classfile-name > + 128 > {code} > In SBT add: > {code} > +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"), > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4820) Spark build encounters "File name too long" on some encrypted filesystems
Patrick Wendell created SPARK-4820: -- Summary: Spark build encounters "File name too long" on some encrypted filesystems Key: SPARK-4820 URL: https://issues.apache.org/jira/browse/SPARK-4820 Project: Spark Issue Type: Bug Reporter: Patrick Wendell This was reported by Luchesar Cekov on github along with a proposed fix. The fix has some potential downstream issues (it will modify the classnames) so until we understand better how many users are affected we aren't going to merge it. However, I'd like to include the issue and workaround here. The issue produces this error: {code} [error] == Expanded type of tree == [error] [error] ConstantType(value = Constant(Throwable)) [error] [error] uncaught exception during compilation: java.io.IOException [error] File name too long [error] two errors found {code} The workaround is in maven under the compile options add: {code} + -Xmax-classfile-name + 128 {code} In SBT add: {code} +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"), {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4819) Remove Guava's "Optional" from public API
Marcelo Vanzin created SPARK-4819: - Summary: Remove Guava's "Optional" from public API Key: SPARK-4819 URL: https://issues.apache.org/jira/browse/SPARK-4819 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 1.2.0 Reporter: Marcelo Vanzin Filing this mostly so this isn't forgotten. Spark currently exposes Guava types in its public API (the {{Optional}} class is used in the Java bindings). This makes it hard to properly hide Guava from user applications, and makes mixing different Guava versions with Spark a little sketchy (even if things should work, since those classes are pretty simple in general). Since this changes the public API, it has to be done in a release that allows such breakages. But it would be nice to at least have a transition plan for deprecating the affected APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4793) way to find assembly jar is too strict
[ https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4793: - Target Version/s: 1.3.0, 1.1.2, 1.2.1 Fix Version/s: 1.3.0 > way to find assembly jar is too strict > -- > > Key: SPARK-4793 > URL: https://issues.apache.org/jira/browse/SPARK-4793 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.1.0 >Reporter: Adrian Wang >Assignee: Adrian Wang >Priority: Minor > Fix For: 1.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4793) way to find assembly jar is too strict
[ https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4793: - Affects Version/s: 1.1.0 > way to find assembly jar is too strict > -- > > Key: SPARK-4793 > URL: https://issues.apache.org/jira/browse/SPARK-4793 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.1.0 >Reporter: Adrian Wang >Assignee: Adrian Wang >Priority: Minor > Fix For: 1.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4793) way to find assembly jar is too strict
[ https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4793: - Assignee: Adrian Wang > way to find assembly jar is too strict > -- > > Key: SPARK-4793 > URL: https://issues.apache.org/jira/browse/SPARK-4793 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Adrian Wang >Assignee: Adrian Wang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4215) Allow requesting executors only on Yarn (for now)
[ https://issues.apache.org/jira/browse/SPARK-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4215: - Fix Version/s: 1.3.0 > Allow requesting executors only on Yarn (for now) > - > > Key: SPARK-4215 > URL: https://issues.apache.org/jira/browse/SPARK-4215 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.2.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > Labels: backport-needed > Fix For: 1.3.0 > > > Currently if the user attempts to call `sc.requestExecutors` or enables > dynamic allocation on, say, standalone mode, it just fails silently. We must > at the very least log a warning to say it's only available for Yarn > currently, or maybe even throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4215) Allow requesting executors only on Yarn (for now)
[ https://issues.apache.org/jira/browse/SPARK-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4215: - Labels: backport-needed (was: ) > Allow requesting executors only on Yarn (for now) > - > > Key: SPARK-4215 > URL: https://issues.apache.org/jira/browse/SPARK-4215 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.2.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > Labels: backport-needed > Fix For: 1.3.0 > > > Currently if the user attempts to call `sc.requestExecutors` or enables > dynamic allocation on, say, standalone mode, it just fails silently. We must > at the very least log a warning to say it's only available for Yarn > currently, or maybe even throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241701#comment-14241701 ] Pat Ferrel commented on SPARK-2075: --- If the explanation is correct this needs to be filed against Spark as putting the wrong or not enough artifacts into maven repos. There would need to be a different artifact for every config option that will change internal naming. I can't understand why lots of people aren't running into this, all it requires is that you link against the repo artifact and run against a user compiled Spark. > Anonymous classes are missing from Spark distribution > - > > Key: SPARK-2075 > URL: https://issues.apache.org/jira/browse/SPARK-2075 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 1.0.0 >Reporter: Paul R. Brown >Priority: Critical > > Running a job built against the Maven dep for 1.0.0 and the hadoop1 > distribution produces: > {code} > java.lang.ClassNotFoundException: > org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 > {code} > Here's what's in the Maven dep as of 1.0.0: > {code} > jar tvf > ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar > | grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class > 1560 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class > {code} > And here's what's in the hadoop1 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' > {code} > I.e., it's not there. It is in the hadoop2 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon May 26 07:29:54 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class > 1560 Mon May 26 07:29:54 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4771) Document standalone --supervise feature
[ https://issues.apache.org/jira/browse/SPARK-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4771. Resolution: Fixed Fix Version/s: 1.2.1 1.1.2 > Document standalone --supervise feature > --- > > Key: SPARK-4771 > URL: https://issues.apache.org/jira/browse/SPARK-4771 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 1.3.0, 1.1.2, 1.2.1 > > > We need this especially for streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4771) Document standalone --supervise feature
[ https://issues.apache.org/jira/browse/SPARK-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4771: - Fix Version/s: 1.3.0 > Document standalone --supervise feature > --- > > Key: SPARK-4771 > URL: https://issues.apache.org/jira/browse/SPARK-4771 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 1.3.0 > > > We need this especially for streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4771) Document standalone --supervise feature
[ https://issues.apache.org/jira/browse/SPARK-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4771: - Target Version/s: 1.3.0, 1.1.2, 1.2.1 (was: 1.1.2, 1.2.1) > Document standalone --supervise feature > --- > > Key: SPARK-4771 > URL: https://issues.apache.org/jira/browse/SPARK-4771 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 1.3.0 > > > We need this especially for streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4329) Add indexing feature for HistoryPage
[ https://issues.apache.org/jira/browse/SPARK-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4329. Resolution: Fixed Assignee: Kousuke Saruta > Add indexing feature for HistoryPage > > > Key: SPARK-4329 > URL: https://issues.apache.org/jira/browse/SPARK-4329 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta > Fix For: 1.3.0 > > > Current HistoryPage have links only to previous page or next page. > I suggest to add index to access history pages easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4329) Add indexing feature for HistoryPage
[ https://issues.apache.org/jira/browse/SPARK-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4329: - Fix Version/s: 1.3.0 > Add indexing feature for HistoryPage > > > Key: SPARK-4329 > URL: https://issues.apache.org/jira/browse/SPARK-4329 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta > Fix For: 1.3.0 > > > Current HistoryPage have links only to previous page or next page. > I suggest to add index to access history pages easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4161) Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4161: - Labels: backport-needed (was: ) > Spark shell class path is not correctly set if "spark.driver.extraClassPath" > is set in defaults.conf > > > Key: SPARK-4161 > URL: https://issues.apache.org/jira/browse/SPARK-4161 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.1.0, 1.2.0 > Environment: Mac, Ubuntu >Reporter: Shay Seng >Assignee: Guoqiang Li > Labels: backport-needed > Fix For: 1.3.0 > > > (1) I want to launch a spark-shell + with jars that are only required by the > driver (ie. not shipped to slaves) > > (2) I added "spark.driver.extraClassPath /mypath/to.jar" to my > spark-defaults.conf > I launched spark-shell with: ./spark-shell > Here I see on the WebUI that spark.driver.extraClassPath has been set, but I > am NOT able to access any methods in the jar. > (3) I removed "spark.driver.extraClassPath" from my spark-default.conf > I launched spark-shell with ./spark-shell --driver.class.path /mypath/to.jar > Again I see that the WebUI spark.driver.extraClassPath has been set. > But this time I am able to access the methods in the jar. > Looks like when the driver class path is loaded from spark-default.conf, the > REPL's classpath is not correctly appended. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4161) Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4161: - Fix Version/s: 1.3.0 > Spark shell class path is not correctly set if "spark.driver.extraClassPath" > is set in defaults.conf > > > Key: SPARK-4161 > URL: https://issues.apache.org/jira/browse/SPARK-4161 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.1.0, 1.2.0 > Environment: Mac, Ubuntu >Reporter: Shay Seng >Assignee: Guoqiang Li > Labels: backport-needed > Fix For: 1.3.0 > > > (1) I want to launch a spark-shell + with jars that are only required by the > driver (ie. not shipped to slaves) > > (2) I added "spark.driver.extraClassPath /mypath/to.jar" to my > spark-defaults.conf > I launched spark-shell with: ./spark-shell > Here I see on the WebUI that spark.driver.extraClassPath has been set, but I > am NOT able to access any methods in the jar. > (3) I removed "spark.driver.extraClassPath" from my spark-default.conf > I launched spark-shell with ./spark-shell --driver.class.path /mypath/to.jar > Again I see that the WebUI spark.driver.extraClassPath has been set. > But this time I am able to access the methods in the jar. > Looks like when the driver class path is loaded from spark-default.conf, the > REPL's classpath is not correctly appended. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4161) Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4161: - Target Version/s: 1.3.0, 1.1.2, 1.2.1 (was: 1.1.2, 1.2.1) > Spark shell class path is not correctly set if "spark.driver.extraClassPath" > is set in defaults.conf > > > Key: SPARK-4161 > URL: https://issues.apache.org/jira/browse/SPARK-4161 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.1.0, 1.2.0 > Environment: Mac, Ubuntu >Reporter: Shay Seng >Assignee: Guoqiang Li > Labels: backport-needed > Fix For: 1.3.0 > > > (1) I want to launch a spark-shell + with jars that are only required by the > driver (ie. not shipped to slaves) > > (2) I added "spark.driver.extraClassPath /mypath/to.jar" to my > spark-defaults.conf > I launched spark-shell with: ./spark-shell > Here I see on the WebUI that spark.driver.extraClassPath has been set, but I > am NOT able to access any methods in the jar. > (3) I removed "spark.driver.extraClassPath" from my spark-default.conf > I launched spark-shell with ./spark-shell --driver.class.path /mypath/to.jar > Again I see that the WebUI spark.driver.extraClassPath has been set. > But this time I am able to access the methods in the jar. > Looks like when the driver class path is loaded from spark-default.conf, the > REPL's classpath is not correctly appended. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2951) SerDeUtils.pythonToPairRDD fails on RDDs of pickled array.arrays in Python 2.6
[ https://issues.apache.org/jira/browse/SPARK-2951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241654#comment-14241654 ] Apache Spark commented on SPARK-2951: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3668 > SerDeUtils.pythonToPairRDD fails on RDDs of pickled array.arrays in Python 2.6 > -- > > Key: SPARK-2951 > URL: https://issues.apache.org/jira/browse/SPARK-2951 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.0 >Reporter: Josh Rosen >Assignee: Davies Liu > Fix For: 1.2.0 > > > With Python 2.6, calling SerDeUtils.pythonToPairRDD() on an RDD of pickled > Python array.arrays will fail with this exception: > {code} > ava.lang.ClassCastException: java.lang.String cannot be cast to > java.util.ArrayList > > net.razorvine.pickle.objects.ArrayConstructor.construct(ArrayConstructor.java:33) > net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) > net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) > net.razorvine.pickle.Unpickler.load(Unpickler.java:84) > net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) > > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToPairRDD$1$$anonfun$5.apply(SerDeUtil.scala:106) > > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToPairRDD$1$$anonfun$5.apply(SerDeUtil.scala:106) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:898) > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:880) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > org.apache.spark.scheduler.Task.run(Task.scala:54) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {code} > I think this is due to a difference in how array.array is pickled in Python > 2.6 vs. Python 2.7. To see this, run the following script: > {code} > from pickletools import dis, optimize > from pickle import dumps, loads, HIGHEST_PROTOCOL > from array import array > arr = array('d', [1, 2, 3]) > #protocol = HIGHEST_PROTOCOL > protocol = 0 > pickled = dumps(arr, protocol=protocol) > pickled = optimize(pickled) > unpickled = loads(pickled) > print arr > print unpickled > print dis(pickled) > {code} > In Python 2.7, this outputs > {code} > array('d', [1.0, 2.0, 3.0]) > array('d', [1.0, 2.0, 3.0]) > 0: cGLOBAL 'array array' >13: (MARK >14: SSTRING 'd' >19: (MARK >20: lLIST (MARK at 19) >21: FFLOAT 1.0 >26: aAPPEND >27: FFLOAT 2.0 >32: aAPPEND >33: FFLOAT 3.0 >38: aAPPEND >39: tTUPLE (MARK at 13) >40: RREDUCE >41: .STOP > highest protocol among opcodes = 0 > None > {code} > whereas 2.6 outputs > {code} > array('d', [1.0, 2.0, 3.0]) > array('d', [1.0, 2.0, 3.0]) > 0: cGLOBAL 'array array' >13: (MARK >14: SSTRING 'd' >19: SSTRING > '\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@' > 110: tTUPLE (MARK at 13) > 111: RREDUCE > 112: .STOP > highest protocol among opcodes = 0 > None > {code} > I think the Java-side depickling library doesn't expect this pickled format, > causing this failure. > I noticed this when running PySpark's unit tests on 2.6 because the > TestOuputFormat.test_newhadoop test failed. > I think that this issue affects all of the methods that might need to > depickle arrays in Java, including all of the Hadoop output format methods. > How should we try to fix this? Require that users upgrade to 2.7 if they > want to use code that requires this? Open a bug with the depickling library > maintainers? Try to hack in our own pickling routines for arrays if we > detect that we're using 2.6? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3918) Forget Unpersist in RandomForest.scala(train Method)
[ https://issues.apache.org/jira/browse/SPARK-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241649#comment-14241649 ] Joseph K. Bradley commented on SPARK-3918: -- Oops! I forgot to update that PR's name. It was originally in that PR, but [~Junlong Liu] sent a PR with the change first: [https://github.com/apache/spark/commit/942847fd94c920f7954ddf01f97263926e512b0e] (The PR linked above was not tagged with this JIRA.) > Forget Unpersist in RandomForest.scala(train Method) > > > Key: SPARK-3918 > URL: https://issues.apache.org/jira/browse/SPARK-3918 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 > Environment: All >Reporter: junlong >Assignee: Joseph K. Bradley > Labels: decisiontree, train, unpersist > Fix For: 1.1.0 > > Original Estimate: 10m > Remaining Estimate: 10m > >In version 1.1.0 DecisionTree.scala, train Method, treeInput has been > persisted in Memory, but without unpersist. It caused heavy DISK usage. >In github version(1.2.0 maybe), RandomForest.scala, train Method, > baggedInput has been persisted but without unpersisted too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3702) Standardize MLlib classes for learners, models
[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3702: - Description: Summary: Create a class hierarchy for learning algorithms and the models those algorithms produce. This is a super-task of several sub-tasks (but JIRA does not allow subtasks of subtasks). See the "requires" links below for subtasks. Goals: * give intuitive structure to API, both for developers and for generated documentation * support meta-algorithms (e.g., boosting) * support generic functionality (e.g., evaluation) * reduce code duplication across classes [Design doc for class hierarchy | https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/] was: Summary: Create a class hierarchy for learning algorithms and the models those algorithms produce. This is a super-task of several sub-tasks (but JIRA does not allow subtasks of subtasks). See the "depends on" links below for subtasks. Goals: * give intuitive structure to API, both for developers and for generated documentation * support meta-algorithms (e.g., boosting) * support generic functionality (e.g., evaluation) * reduce code duplication across classes [Design doc for class hierarchy | https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/] > Standardize MLlib classes for learners, models > -- > > Key: SPARK-3702 > URL: https://issues.apache.org/jira/browse/SPARK-3702 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > > Summary: Create a class hierarchy for learning algorithms and the models > those algorithms produce. > This is a super-task of several sub-tasks (but JIRA does not allow subtasks > of subtasks). See the "requires" links below for subtasks. > Goals: > * give intuitive structure to API, both for developers and for generated > documentation > * support meta-algorithms (e.g., boosting) > * support generic functionality (e.g., evaluation) > * reduce code duplication across classes > [Design doc for class hierarchy | > https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models
[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241611#comment-14241611 ] Joseph K. Bradley commented on SPARK-3702: -- APIs for Classifiers, Regressors > Standardize MLlib classes for learners, models > -- > > Key: SPARK-3702 > URL: https://issues.apache.org/jira/browse/SPARK-3702 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > > Summary: Create a class hierarchy for learning algorithms and the models > those algorithms produce. > This is a super-task of several sub-tasks (but JIRA does not allow subtasks > of subtasks). See the "depends on" links below for subtasks. > Goals: > * give intuitive structure to API, both for developers and for generated > documentation > * support meta-algorithms (e.g., boosting) > * support generic functionality (e.g., evaluation) > * reduce code duplication across classes > [Design doc for class hierarchy | > https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3702) Standardize MLlib classes for learners, models
[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3702: - Description: Summary: Create a class hierarchy for learning algorithms and the models those algorithms produce. This is a super-task of several sub-tasks (but JIRA does not allow subtasks of subtasks). See the "depends on" links below for subtasks. Goals: * give intuitive structure to API, both for developers and for generated documentation * support meta-algorithms (e.g., boosting) * support generic functionality (e.g., evaluation) * reduce code duplication across classes [Design doc for class hierarchy | https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/] was: Summary: Create a class hierarchy for learning algorithms and the models those algorithms produce. Goals: * give intuitive structure to API, both for developers and for generated documentation * support meta-algorithms (e.g., boosting) * support generic functionality (e.g., evaluation) * reduce code duplication across classes [Design doc for class hierarchy | https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/] > Standardize MLlib classes for learners, models > -- > > Key: SPARK-3702 > URL: https://issues.apache.org/jira/browse/SPARK-3702 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > > Summary: Create a class hierarchy for learning algorithms and the models > those algorithms produce. > This is a super-task of several sub-tasks (but JIRA does not allow subtasks > of subtasks). See the "depends on" links below for subtasks. > Goals: > * give intuitive structure to API, both for developers and for generated > documentation > * support meta-algorithms (e.g., boosting) > * support generic functionality (e.g., evaluation) > * reduce code duplication across classes > [Design doc for class hierarchy | > https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4789) Standardize ML Prediction APIs
[ https://issues.apache.org/jira/browse/SPARK-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-4789: Assignee: Joseph K. Bradley > Standardize ML Prediction APIs > -- > > Key: SPARK-4789 > URL: https://issues.apache.org/jira/browse/SPARK-4789 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > Create a standard set of abstractions for prediction in spark.ml. This will > follow the design doc specified in [SPARK-3702]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4789) Standardize ML Prediction APIs
[ https://issues.apache.org/jira/browse/SPARK-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-4789: - Issue Type: Sub-task (was: New Feature) Parent: SPARK-1856 > Standardize ML Prediction APIs > -- > > Key: SPARK-4789 > URL: https://issues.apache.org/jira/browse/SPARK-4789 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Create a standard set of abstractions for prediction in spark.ml. This will > follow the design doc specified in [SPARK-3702]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel
[ https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241535#comment-14241535 ] Debasish Das commented on SPARK-4675: - There are few issues: 1. Batch API for topK similar users and topK similar products 2. Comparison of product x product similarities generated with columnSimilarities and compared with topK similar products I added batch APIs for topK product recommendation for each user and topK user recommendation for each product in SPARK-4231...similar batch API will be very helpful for topK similar users and topK similar products... I agree with Cosine Similarity...you should be able to re-use column similarity calculations...I think a better idea is to add rowMatrix.similarRows and re-use that code to generate product similarities and user similarities... But my question is more on validation. We can compute product similarities on raw features and we can compute product similarities on matrix product factor...which one is better ? > Find similar products and similar users in MatrixFactorizationModel > --- > > Key: SPARK-4675 > URL: https://issues.apache.org/jira/browse/SPARK-4675 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Steven Bourke >Priority: Trivial > Labels: mllib, recommender > > Using the latent feature space that is learnt in MatrixFactorizationModel, I > have added 2 new functions to find similar products and similar users. A user > of the API can for example pass a product ID, and get the closest products. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241524#comment-14241524 ] Apache Spark commented on SPARK-4740: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/3667 > Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey > > > Key: SPARK-4740 > URL: https://issues.apache.org/jira/browse/SPARK-4740 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.2.0 >Reporter: Zhang, Liye >Assignee: Reynold Xin > Attachments: (rxin patch better executor)TestRunner sort-by-key - > Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner > sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report > 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner > sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, > TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per > node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z > > > When testing current spark master (1.3.0-snapshot) with spark-perf > (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService > takes much longer time than NIO based shuffle transferService. The network > throughput of Netty is only about half of that of NIO. > We tested with standalone mode, and the data set we used for test is 20 > billion records, and the total size is about 400GB. Spark-perf test is > Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each > executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241523#comment-14241523 ] Reynold Xin commented on SPARK-4740: Also [~jerryshao] when I asked you to disable transferTo, the code you pasted still return a ByteBuffer, which wouldn't work in Netty. Was the code you pasted here different from what was compiled? Can you try this patch, which disables transferTo? https://github.com/apache/spark/pull/3667 > Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey > > > Key: SPARK-4740 > URL: https://issues.apache.org/jira/browse/SPARK-4740 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.2.0 >Reporter: Zhang, Liye >Assignee: Reynold Xin > Attachments: (rxin patch better executor)TestRunner sort-by-key - > Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner > sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report > 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner > sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, > TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per > node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z > > > When testing current spark master (1.3.0-snapshot) with spark-perf > (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService > takes much longer time than NIO based shuffle transferService. The network > throughput of Netty is only about half of that of NIO. > We tested with standalone mode, and the data set we used for test is 20 > billion records, and the total size is about 400GB. Spark-perf test is > Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each > executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241463#comment-14241463 ] Aaron Davidson commented on SPARK-4740: --- Clarification: The merged version of Reynold's patch has connectionsPerPeer set to 1, since we could not demonstrate a significant improvement with other values. In your test with HDDs, did you have it set, or was it using the default value? > Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey > > > Key: SPARK-4740 > URL: https://issues.apache.org/jira/browse/SPARK-4740 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.2.0 >Reporter: Zhang, Liye >Assignee: Reynold Xin > Attachments: (rxin patch better executor)TestRunner sort-by-key - > Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner > sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report > 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner > sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, > TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per > node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z > > > When testing current spark master (1.3.0-snapshot) with spark-perf > (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService > takes much longer time than NIO based shuffle transferService. The network > throughput of Netty is only about half of that of NIO. > We tested with standalone mode, and the data set we used for test is 20 > billion records, and the total size is about 400GB. Spark-perf test is > Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each > executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4569) Rename "externalSorting" in Aggregator
[ https://issues.apache.org/jira/browse/SPARK-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241455#comment-14241455 ] Apache Spark commented on SPARK-4569: - User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/3666 > Rename "externalSorting" in Aggregator > -- > > Key: SPARK-4569 > URL: https://issues.apache.org/jira/browse/SPARK-4569 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.2.0 >Reporter: Sandy Ryza >Priority: Trivial > > While technically all spilling in Spark does result in sorting, calling this > variable externalSorting makes it seem like ExternalSorter will be used, when > in fact it just means whether spilling is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1037) the name of findTaskFromList & findTask in TaskSetManager.scala is confusing
[ https://issues.apache.org/jira/browse/SPARK-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241442#comment-14241442 ] Apache Spark commented on SPARK-1037: - User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/3665 > the name of findTaskFromList & findTask in TaskSetManager.scala is confusing > > > Key: SPARK-1037 > URL: https://issues.apache.org/jira/browse/SPARK-1037 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 0.9.1, 1.0.0 >Reporter: Nan Zhu >Priority: Minor > Labels: starter > > the name of these two functions is confusing > though in the comments the author said that the method does "dequeue" tasks > from the list but from the name, it is not explicitly indicating that the > method will mutate the parameter > in > private def findTaskFromList(list: ArrayBuffer[Int]): Option[Int] = { > while (!list.isEmpty) { > val index = list.last > list.trimEnd(1) > if (copiesRunning(index) == 0 && !successful(index)) { > return Some(index) > } > } > None > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3607) ConnectionManager threads.max configs on the thread pools don't work
[ https://issues.apache.org/jira/browse/SPARK-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241431#comment-14241431 ] Apache Spark commented on SPARK-3607: - User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/3664 > ConnectionManager threads.max configs on the thread pools don't work > > > Key: SPARK-3607 > URL: https://issues.apache.org/jira/browse/SPARK-3607 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Thomas Graves >Priority: Minor > > In the ConnectionManager we have a bunch of thread pools. They have settings > for the maximum number of threads for each Threadpool (like > spark.core.connection.handler.threads.max). > Those configs don't work because its using a unbounded queue. From the > threadpoolexecutor javadoc page: no more than corePoolSize threads will ever > be created. (And the value of the maximumPoolSize therefore doesn't have any > effect.) > luckily this doesn't matter to much as you can work around it by just > increasing the minimum like spark.core.connection.handler.threads.min. > These configs aren't documented either so its more of an internal thing when > someone is reading the code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4746) integration tests should be separated from faster unit tests
[ https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241383#comment-14241383 ] Imran Rashid commented on SPARK-4746: - I think this can be done two ways: (1) by moving all the integration tests into a separate folder (eg core/src/integ-test). I don't actually know how to make that work in sbt & maven but I'm hoping it won't be too complicated. or (2) we use scalatest's test tags for integration tests, so they can be skipped. http://scalatest.org/user_guide/tagging_your_tests test tags have the advantage that you can have multple tags, and you can group them, etc., so you can get more flexibility in what you choose to run. This could be useful later eg. we might want to tag performance tests as separate from correctness tests, etc. We don't need to do all that now but this would open the door for it at least. I have some experience w/ using test tags. If we want to use that approach, I can assign to myself and work on a PR. > integration tests should be separated from faster unit tests > > > Key: SPARK-4746 > URL: https://issues.apache.org/jira/browse/SPARK-4746 > Project: Spark > Issue Type: Bug >Reporter: Imran Rashid >Priority: Trivial > > Currently there isn't a good way for a developer to skip the longer > integration tests. This can slow down local development. See > http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html > One option is to use scalatest's notion of test tags to tag all integration > tests, so they could easily be skipped -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4818) Join operation should use iterator/lazy evaluation
Johannes Simon created SPARK-4818: - Summary: Join operation should use iterator/lazy evaluation Key: SPARK-4818 URL: https://issues.apache.org/jira/browse/SPARK-4818 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.1 Reporter: Johannes Simon Priority: Minor The current implementation of the join operation does not use an iterator (i.e. lazy evaluation), causing it to explicitly evaluate the co-grouped values. In big data applications, these value collections can be very large. This causes the *cartesian product of all co-grouped values* for a specific key of both RDDs to be kept in memory during the flatMapValues operation, resulting in an *O(size(pair._1)*size(pair._2))* memory consumption instead of *O(1)*. Very large value collections will therefore cause "GC overhead limit exceeded" exceptions and fail the task, or at least slow down execution dramatically. {code:title=PairRDDFunctions.scala|borderStyle=solid} //... def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = { this.cogroup(other, partitioner).flatMapValues( pair => for (v <- pair._1; w <- pair._2) yield (v, w) ) } //... {code} Since cogroup returns an Iterable instance of an Array, the join implementation could be changed to the following, which uses lazy evaluation instead, and has almost no memory overhead: {code:title=PairRDDFunctions.scala|borderStyle=solid} //... def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = { this.cogroup(other, partitioner).flatMapValues( pair => for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w) ) } //... {code} Alternatively, if the current implementation is intentionally not using lazy evaluation for some reason, there could be a *lazyJoin()* method next to the original join implementation that utilizes lazy evaluation. This of course applies to other join operations as well. Thanks! :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped
[ https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241362#comment-14241362 ] Mark Fisher commented on SPARK-2892: [~srowen] SPARK-4802 is only related to the receiverInfo not being removed. This issue is actually much more critical, given that Receivers do not seem to stop other than in local mode. Please reopen. > Socket Receiver does not stop when streaming context is stopped > --- > > Key: SPARK-2892 > URL: https://issues.apache.org/jira/browse/SPARK-2892 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.0.2 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > Running NetworkWordCount with > {quote} > ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); > Thread.sleep(6) > {quote} > gives the following error > {quote} > 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) > in 10047 ms on localhost (1/1) > 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at > ReceiverTracker.scala:275) finished in 10.056 s > 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at > ReceiverTracker.scala:275, took 10.179263 s > 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been > terminated > 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not > deregistered, Map(0 -> > ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,)) > 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped > 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately > 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after > time 1407375433000 > 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator > 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler > 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully > 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving > 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost: > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4746) integration tests should be separated from faster unit tests
[ https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-4746: Summary: integration tests should be separated from faster unit tests (was: integration tests should be seseparated from faster unit tests) > integration tests should be separated from faster unit tests > > > Key: SPARK-4746 > URL: https://issues.apache.org/jira/browse/SPARK-4746 > Project: Spark > Issue Type: Bug >Reporter: Imran Rashid >Priority: Trivial > > Currently there isn't a good way for a developer to skip the longer > integration tests. This can slow down local development. See > http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html > One option is to use scalatest's notion of test tags to tag all integration > tests, so they could easily be skipped -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1338) Create Additional Style Rules
[ https://issues.apache.org/jira/browse/SPARK-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241297#comment-14241297 ] Sean Owen commented on SPARK-1338: -- The PR for this was abandoned. What's the thinking on these style-rule JIRAs? there are a number still open. > Create Additional Style Rules > - > > Key: SPARK-1338 > URL: https://issues.apache.org/jira/browse/SPARK-1338 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Patrick Wendell >Assignee: Prashant Sharma > Fix For: 1.2.0 > > > There are a few other rules that would be helpful to have. Also we should add > tests for these rules because it's easy to get them wrong. I gave some > example comparisons from a javascript style checker. > Require spaces in type declarations: > def foo:String = X // no > def foo: String = XXX > def x:Int = 100 // no > val x: Int = 100 > Require spaces after keywords: > if(x - 3) // no > if (x + 10) > See: requireSpaceAfterKeywords from > https://github.com/mdevils/node-jscs > Disallow spaces inside of parentheses: > val x = ( 3 + 5 ) // no > val x = (3 + 5) > See: disallowSpacesInsideParentheses from > https://github.com/mdevils/node-jscs > Require spaces before and after binary operators: > See: requireSpaceBeforeBinaryOperators > See: disallowSpaceAfterBinaryOperators > from https://github.com/mdevils/node-jscs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1380) Add sort-merge based cogroup/joins.
[ https://issues.apache.org/jira/browse/SPARK-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1380. -- Resolution: Won't Fix The PR discussion suggests this is WontFix. > Add sort-merge based cogroup/joins. > --- > > Key: SPARK-1380 > URL: https://issues.apache.org/jira/browse/SPARK-1380 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Takuya Ueshin > > I've written cogroup/joins based on 'Sort-Merge' algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1385) Use existing code-path for JSON de/serialization of BlockId
[ https://issues.apache.org/jira/browse/SPARK-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1385. -- Resolution: Fixed PR is https://github.com/apache/spark/pull/289. This was merged in https://github.com/apache/spark/commit/de8eefa804e229635eaa29a78b9e9ce161ac58e1 > Use existing code-path for JSON de/serialization of BlockId > --- > > Key: SPARK-1385 > URL: https://issues.apache.org/jira/browse/SPARK-1385 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 0.9.0, 0.9.1 >Reporter: Andrew Or >Priority: Minor > Fix For: 1.0.0 > > > BlockId.scala already takes care of JSON de/serialization by parsing the > string to and from regex. This functionality is currently duplicated in > util/JsonProtocol.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1127) Add saveAsHBase to PairRDDFunctions
[ https://issues.apache.org/jira/browse/SPARK-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1127. -- Resolution: Won't Fix Fix Version/s: (was: 1.2.0) Given the discussion in both PRs, this looks like a WontFix, and consensus was it should proceed in a separate project. Questions about SchemaRDD and 1.2 sound like a new topic. > Add saveAsHBase to PairRDDFunctions > --- > > Key: SPARK-1127 > URL: https://issues.apache.org/jira/browse/SPARK-1127 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: haosdent huang >Assignee: haosdent huang > > Support to save data in HBase. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org