date:20141210

[jira] [Resolved] (SPARK-4791) Create SchemaRDD from case classes with multiple constructors

2014-12-10 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4791.
-
Resolution: Fixed
  Assignee: Joseph K. Bradley

> Create SchemaRDD from case classes with multiple constructors
> -
>
> Key: SPARK-4791
> URL: https://issues.apache.org/jira/browse/SPARK-4791
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Issue: One can usually take an RDD of case classes and create a SchemaRDD, 
> where Spark SQL infers the schema from the case class metadata.  However, if 
> the case class has multiple constructors, then ScalaReflection.schemaFor gets 
> confused.
> Motivation: In spark.ml, I would like to create a class with the following 
> signature:
> ```
> case class LabeledPoint(label: Double, features: Vector, weight: Double) {
>   def this(label: Double, features: Vector) = this(label, features, 1.0)
> }
> ```
> Proposed fix: Change ScalaReflection.schemaFor so it checks for whether there 
> are multiple constructors.  If there are multiple ones, it should take the 
> primary constructor.  This will not change the behavior of existing code 
> since it currently only supports case classes with 1 constructor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4827) Max iterations (100) reached for batch Resolution with deeply nested projects and project *s

2014-12-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242271#comment-14242271
 ] 

Apache Spark commented on SPARK-4827:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/3674

> Max iterations (100) reached for batch Resolution with deeply nested projects 
> and project *s
> 
>
> Key: SPARK-4827
> URL: https://issues.apache.org/jira/browse/SPARK-4827
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4827) Max iterations (100) reached for batch Resolution with deeply nested projects and project *s

2014-12-10 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-4827:
---

 Summary: Max iterations (100) reached for batch Resolution with 
deeply nested projects and project *s
 Key: SPARK-4827
 URL: https://issues.apache.org/jira/browse/SPARK-4827
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242177#comment-14242177
 ] 

宿荣全 edited comment on SPARK-4817 at 12/11/14 7:15 AM:
--

[~srowen]
Always call {{foreachRDD}}, and operate on all of the RDD, and then call 
{{take}} on the RDD to get a few elements to print. It can achieve the effect, 
but it is more complicated.
{code:title=example NO.1:|borderStyle=solid}
val dstream = stream.map->filter->..foreachRDD(rdd => {
  val array = rdd.collect
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
{code}
{code:title=example NO.2:|borderStyle=solid}
val dstream = stream.map->filter->foreachRDD(rdd => {
  val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, 
String)]) => iter.toArray)
  val array = Array.concat(rddarray: _*)
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
{code}
This two samples can achieve the effect. From the design perspective streaming 
direct manipulation of the RDD is not a good design.and I think the method 
{{foreachRDD}}  is generally not used in coding. 
Generally when streaming register action by  through the following 5 methods. 
Those methods all called method {{foreachRDD}}.

# {color:blue}DStream.foreach{color}
# {color:blue}DStream.saveAsObjectFiles{color}
# {color:blue}DStream.saveAsTextFiles{color}
# {color:blue}PairDStreamFunctions.saveAsHadoopFiles{color}
# {color:blue}PairDStreamFunctions.saveAsNewAPIHadoopFiles{color}


was (Author: surq):
[~srowen]
Always call {{foreachRDD}}, and operate on all of the RDD, and then call 
{{take}} on the RDD to get a few elements to print. It can achieve the effect, 
but it is more complicated.
{code:title=for example|borderStyle=solid}:
1.val dstream = stream.map->filter->..foreachRDD(rdd => {
  val array = rdd.collect
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
2.val dstream = stream.map->filter->foreachRDD(rdd => {
  val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, 
String)]) => iter.toArray)
  val array = Array.concat(rddarray: _*)
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
{code}
This two samples can achieve the effect. From the design perspective streaming 
direct manipulation of the RDD is not a good design.and I think the method 
{{foreachRDD}}  is generally not used in coding. 
Generally when streaming register action by  through the following 6 methods. 
Those methods all called method {{foreachRDD}}.
{color:blue}
# DStream.foreach
# DStream.saveAsObjectFiles
# DStream.saveAsTextFiles
# PairDStreamFunctions.saveAsHadoopFiles
# PairDStreamFunctions.saveAsNewAPIHadoopFiles
{color}

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4825) CTAS fails to resolve when created using saveAsTable

2014-12-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242259#comment-14242259
 ] 

Apache Spark commented on SPARK-4825:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/3673

> CTAS fails to resolve when created using saveAsTable
> 
>
> Key: SPARK-4825
> URL: https://issues.apache.org/jira/browse/SPARK-4825
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Michael Armbrust
>Assignee: Cheng Hao
>Priority: Critical
>
> While writing a test for a different issue, I found that saveAsTable seems to 
> be broken:
> {code}
>   test("save join to table") {
> val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, 
> i.toString))
> sql("CREATE TABLE test1 (key INT, value STRING)")
> testData.insertInto("test1")
> sql("CREATE TABLE test2 (key INT, value STRING)")
> testData.insertInto("test2")
> testData.insertInto("test2")
> sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = 
> b.key").saveAsTable("test")
> checkAnswer(
>   table("test"),
>   sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = 
> b.key").collect().toSeq)
>   }
> 
> sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = 
> b.key").saveAsTable("test")
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
> plan found, tree:
> 'CreateTableAsSelect None, test, false, None
>  Aggregate [], [COUNT(value#336) AS _c0#334L]
>   Join Inner, Some((key#335 = key#339))
>MetastoreRelation default, test1, Some(a)
>MetastoreRelation default, test2, Some(b)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4790) Flaky test in ReceivedBlockTrackerSuite: "block addition, block to batch allocation, and cleanup with write ahead log"

2014-12-10 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242250#comment-14242250
 ] 

Josh Rosen commented on SPARK-4790:
---

Actually, scratch that theory: here's a failure in the Master SBT build: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/1206/

> Flaky test in ReceivedBlockTrackerSuite: "block addition, block to batch 
> allocation, and cleanup with write ahead log"
> --
>
> Key: SPARK-4790
> URL: https://issues.apache.org/jira/browse/SPARK-4790
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Tathagata Das
>  Labels: flaky-test
>
> Found another flaky streaming test, 
> "org.apache.spark.streaming.ReceivedBlockTrackerSuite.block addition, block 
> to batch allocation and cleanup with write ahead log":
> {code}
> Error Message
> File /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist.
> Stacktrace
> sbt.ForkMain$ForkError: File 
> /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist.
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:324)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogSuite$.getLogFilesInDirectory(WriteAheadLogSuite.scala:344)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite.getWriteAheadLogFiles(ReceivedBlockTrackerSuite.scala:248)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply$mcV$sp(ReceivedBlockTrackerSuite.scala:173)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite.org$scalatest$BeforeAndAfter$$super$runTest(ReceivedBlockTrackerSuite.scala:41)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite.runTest(ReceivedBlockTrackerSuite.scala:41)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite.org$scalatest$BeforeAndAfter$$super$run(ReceivedBlockTrackerSuite.scala:41)
>   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
>   a

[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"

2014-12-10 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242248#comment-14242248
 ] 

Josh Rosen commented on SPARK-4826:
---

Actually, it turns out that this has happened twice.  Here's a link to another 
occurrence of the same failure pattern: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1147/

> Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: 
> "java.lang.IllegalStateException: File exists and there is no append support!"
> 
>
> Key: SPARK-4826
> URL: https://issues.apache.org/jira/browse/SPARK-4826
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Josh Rosen
>Assignee: Tathagata Das
>  Labels: flaky-test
>
> I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite 
> where four tests failed with the same exception.
> [Link to test result (this will eventually 
> break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/].
>   In case that link breaks:
> The failed tests:
> {code}
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in block manager, not in write ahead log
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in write ahead log, not in block manager
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in write ahead log, and test storing in block manager
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> with partially available in block manager, and rest in write ahead log
> {code}
> The error messages are all (essentially) the same:
> {code}
>  java.lang.IllegalStateException: File exists and there is no append 
> support!
>   at 
> org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$Su

[jira] [Created] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"

2014-12-10 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-4826:
-

 Summary: Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: 
"java.lang.IllegalStateException: File exists and there is no append support!"
 Key: SPARK-4826
 URL: https://issues.apache.org/jira/browse/SPARK-4826
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Tathagata Das


I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite 
where four tests failed with the same exception.

[Link to test result (this will eventually 
break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/].
  In case that link breaks:

The failed tests:

{code}
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
available only in block manager, not in write ahead log
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
available only in write ahead log, not in block manager
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
available only in write ahead log, and test storing in block manager
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data with 
partially available in block manager, and rest in write ahead log
{code}

The error messages are all (essentially) the same:

{code}
 java.lang.IllegalStateException: File exists and there is no append 
support!
  at 
org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33)
  at 
org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34)
  at 
org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34)
  at 
org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42)
  at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140)
  at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95)
  at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67)
  at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
  at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
  at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
  at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
  at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
  at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
  at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
  at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
  at scala.collection.immutable.List.foreach(List.scala:318)
  at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
  at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
  at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
  at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
  at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
  at org.scalatest.Suite$class.run(Suite.scala:1424)
  at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
  at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
  at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
  at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
  at org.scalatest.FunSuiteLike$class.run(FunSuiteLi

[jira] [Updated] (SPARK-1600) flaky test case in streaming.CheckpointSuite

2014-12-10 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-1600:
--
Labels: flaky-test  (was: )

> flaky test case in streaming.CheckpointSuite
> 
>
> Key: SPARK-1600
> URL: https://issues.apache.org/jira/browse/SPARK-1600
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.2.0, 1.3.0
>Reporter: Nan Zhu
>  Labels: flaky-test
>
> the case "recovery with file input stream.recovery with file input stream  " 
> sometimes fails when the Jenkins is very busy with an unrelated change 
> I have met it for 3 times, I also saw it in other places, 
> the latest example is in 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14397/
> where the modification is just in YARN related files
> I once reported in dev mail list: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/a-weird-test-case-in-Streaming-td6116.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4790) Flaky test in ReceivedBlockTrackerSuite: "block addition, block to batch allocation, and cleanup with write ahead log"

2014-12-10 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4790:
--
Labels: flaky-test  (was: )

> Flaky test in ReceivedBlockTrackerSuite: "block addition, block to batch 
> allocation, and cleanup with write ahead log"
> --
>
> Key: SPARK-4790
> URL: https://issues.apache.org/jira/browse/SPARK-4790
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Tathagata Das
>  Labels: flaky-test
>
> Found another flaky streaming test, 
> "org.apache.spark.streaming.ReceivedBlockTrackerSuite.block addition, block 
> to batch allocation and cleanup with write ahead log":
> {code}
> Error Message
> File /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist.
> Stacktrace
> sbt.ForkMain$ForkError: File 
> /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist.
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:324)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogSuite$.getLogFilesInDirectory(WriteAheadLogSuite.scala:344)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite.getWriteAheadLogFiles(ReceivedBlockTrackerSuite.scala:248)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply$mcV$sp(ReceivedBlockTrackerSuite.scala:173)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite.org$scalatest$BeforeAndAfter$$super$runTest(ReceivedBlockTrackerSuite.scala:41)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite.runTest(ReceivedBlockTrackerSuite.scala:41)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite.org$scalatest$BeforeAndAfter$$super$run(ReceivedBlockTrackerSuite.scala:41)
>   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite.run(ReceivedBlockTrackerSuite.scala:41)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Fram

[jira] [Commented] (SPARK-4790) Flaky test in ReceivedBlockTrackerSuite: "block addition, block to batch allocation, and cleanup with write ahead log"

2014-12-10 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242243#comment-14242243
 ] 

Josh Rosen commented on SPARK-4790:
---

Curiously, it looks like this test hasn't failed recently in the pull request 
builder 
([link|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24351/testReport/junit/org.apache.spark.streaming/ReceivedBlockTrackerSuite/block_addition__block_to_batch_allocation_and_cleanup_with_write_ahead_log/history/]).
  However, it looks like it has failed on-and-off in the Maven builds 
([link|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1132/hadoop.version=1.0.4,label=centos/testReport/junit/org.apache.spark.streaming/ReceivedBlockTrackerSuite/block_addition__block_to_batch_allocation_and_cleanup_with_write_ahead_log/history/]).
  Perhaps the problem could be related to test suite initialization working 
differently in Maven than SBT?  Just a theory; maybe it actually _has_ failed 
in the PRB and the recent test history is just a lucky streak.

> Flaky test in ReceivedBlockTrackerSuite: "block addition, block to batch 
> allocation, and cleanup with write ahead log"
> --
>
> Key: SPARK-4790
> URL: https://issues.apache.org/jira/browse/SPARK-4790
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Tathagata Das
>
> Found another flaky streaming test, 
> "org.apache.spark.streaming.ReceivedBlockTrackerSuite.block addition, block 
> to batch allocation and cleanup with write ahead log":
> {code}
> Error Message
> File /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist.
> Stacktrace
> sbt.ForkMain$ForkError: File 
> /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist.
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:324)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogSuite$.getLogFilesInDirectory(WriteAheadLogSuite.scala:344)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite.getWriteAheadLogFiles(ReceivedBlockTrackerSuite.scala:248)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply$mcV$sp(ReceivedBlockTrackerSuite.scala:173)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite.org$scalatest$BeforeAndAfter$$super$runTest(ReceivedBlockTrackerSuite.scala:41)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
>   at 
> org.apache.spark.streaming.ReceivedBlockTrackerSuite.runTest(ReceivedBlockTrackerSuite.scala:41)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTe

[jira] [Comment Edited] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242177#comment-14242177
 ] 

宿荣全 edited comment on SPARK-4817 at 12/11/14 7:03 AM:
--

[~srowen]
Always call {{foreachRDD}}, and operate on all of the RDD, and then call 
{{take}} on the RDD to get a few elements to print. It can achieve the effect, 
but it is more complicated.
{code:title=for example|borderStyle=solid}:
1.val dstream = stream.map->filter->..foreachRDD(rdd => {
  val array = rdd.collect
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
2.val dstream = stream.map->filter->foreachRDD(rdd => {
  val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, 
String)]) => iter.toArray)
  val array = Array.concat(rddarray: _*)
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
{code}
This two samples can achieve the effect. From the design perspective streaming 
direct manipulation of the RDD is not a good design.and I think the method 
{{foreachRDD}}  is generally not used in coding. 
Generally when streaming register action by  through the following 6 methods. 
Those methods all called method {{foreachRDD}}.
{color:blue}
# DStream.foreach
# DStream.saveAsObjectFiles
# DStream.saveAsTextFiles
# PairDStreamFunctions.saveAsHadoopFiles
# PairDStreamFunctions.saveAsNewAPIHadoopFiles
{color}


was (Author: surq):
[~srowen]
Always call {{foreachRDD}}, and operate on all of the RDD, and then call {{take 
}} on the RDD to get a few elements to print. It can achieve the effect, but it 
is more complicated.
for example：
1.val dstream = stream.map->filter->..foreachRDD(rdd => {
  val array = rdd.collect;
  var result = Array[(String,String)]();
  result = if (array.size > 5) array.take(5) else array.take(array.size);
  result foreach println;
})
2.val dstream = stream.map->filter->foreachRDD(rdd => {
  val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, 
String)]) => iter.toArray);
  val array = Array.concat(rddarray: _*);
  var result = Array[(String,String)]();
  result = if (array.size > 5) array.take(5) else array.take(array.size);
  result foreach println;
})
This two samples can achieve the effect. From the design perspective streaming 
direct manipulation of the RDD is not a good design.and I think the method 
{{foreachRDD}}  is generally not used in coding. 
Generally when streaming register action by  through the following 6 methods. 
Those methods all called method {{foreachRDD}}.
*
{color:blue}
# DStream.foreach
# DStream.saveAsObjectFiles
# DStream.saveAsTextFiles
# PairDStreamFunctions.saveAsHadoopFiles
# PairDStreamFunctions.saveAsNewAPIHadoopFiles
{color}
*

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242177#comment-14242177
 ] 

宿荣全 edited comment on SPARK-4817 at 12/11/14 6:59 AM:
--

[~srowen]
Always call {{foreachRDD}}, and operate on all of the RDD, and then call {{take 
}} on the RDD to get a few elements to print. It can achieve the effect, but it 
is more complicated.
for example：
1.val dstream = stream.map->filter->..foreachRDD(rdd => {
  val array = rdd.collect;
  var result = Array[(String,String)]();
  result = if (array.size > 5) array.take(5) else array.take(array.size);
  result foreach println;
})
2.val dstream = stream.map->filter->foreachRDD(rdd => {
  val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, 
String)]) => iter.toArray);
  val array = Array.concat(rddarray: _*);
  var result = Array[(String,String)]();
  result = if (array.size > 5) array.take(5) else array.take(array.size);
  result foreach println;
})
This two samples can achieve the effect. From the design perspective streaming 
direct manipulation of the RDD is not a good design.and I think the method 
{{foreachRDD}}  is generally not used in coding. 
Generally when streaming register action by  through the following 6 methods. 
Those methods all called method {{foreachRDD}}.
*
{color:blue}
# DStream.foreach
# DStream.saveAsObjectFiles
# DStream.saveAsTextFiles
# PairDStreamFunctions.saveAsHadoopFiles
# PairDStreamFunctions.saveAsNewAPIHadoopFiles
{color}
*


was (Author: surq):
[~srowen]
Always call foreachRDD, and operate on all of the RDD, and then call take on 
the RDD to get a few elements to print. It can achieve the effect, but it is 
more complicated.
for example：
1.val dstream = stream.map->filter->..foreachRDD(rdd => {
  val array = rdd.collect

  var result = Array[(String,String)]()

  result = if (array.size > 5) array.take(5) else array.take(array.size)

  result foreach println
})
2.val dstream = stream.map->filter->foreachRDD(rdd => {
  val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, 
String)]) => iter.toArray)

  val array = Array.concat(rddarray: _*)

  var result = Array[(String,String)]()

  result = if (array.size > 5) array.take(5) else array.take(array.size)

  result foreach println
})
This two samples can achieve the effect. From the design perspective streaming 
direct manipulation of the RDD is not a good design.and I think the method 
'foreachRDD'  is generally not used in coding. 
Generally when streaming register action by  through the following 6 methods. 
Those methods all called method 'foreachRDD'.
1.DStream.foreach
2.DStream.saveAsObjectFiles
3.DStream.saveAsTextFiles
4.PairDStreamFunctions.saveAsHadoopFiles
5.PairDStreamFunctions.saveAsNewAPIHadoopFiles

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4700) Add Http support to Spark Thrift server

2014-12-10 Thread Judy Nash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242218#comment-14242218
 ] 

Judy Nash commented on SPARK-4700:
--

pull request created at https://github.com/apache/spark/pull/3672

> Add Http support to Spark Thrift server
> ---
>
> Key: SPARK-4700
> URL: https://issues.apache.org/jira/browse/SPARK-4700
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.2.1
> Environment: Linux and Windows
>Reporter: Judy Nash
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Currently thrift only supports TCP connection. 
> The JIRA is to add HTTP support to spark thrift server in addition to the TCP 
> protocol. Both TCP and HTTP are supported by Hive today. HTTP is more secure 
> and used often in Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4700) Add Http support to Spark Thrift server

2014-12-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242219#comment-14242219
 ] 

Apache Spark commented on SPARK-4700:
-

User 'judynash' has created a pull request for this issue:
https://github.com/apache/spark/pull/3672

> Add Http support to Spark Thrift server
> ---
>
> Key: SPARK-4700
> URL: https://issues.apache.org/jira/browse/SPARK-4700
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.2.1
> Environment: Linux and Windows
>Reporter: Judy Nash
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Currently thrift only supports TCP connection. 
> The JIRA is to add HTTP support to spark thrift server in addition to the TCP 
> protocol. Both TCP and HTTP are supported by Hive today. HTTP is more secure 
> and used often in Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4700) Add Http support to Spark Thrift server

2014-12-10 Thread Judy Nash (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Judy Nash updated SPARK-4700:
-
Affects Version/s: 1.3.0

> Add Http support to Spark Thrift server
> ---
>
> Key: SPARK-4700
> URL: https://issues.apache.org/jira/browse/SPARK-4700
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.2.1
> Environment: Linux and Windows
>Reporter: Judy Nash
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Currently thrift only supports TCP connection. 
> The JIRA is to add HTTP support to spark thrift server in addition to the TCP 
> protocol. Both TCP and HTTP are supported by Hive today. HTTP is more secure 
> and used often in Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4720) Remainder should also return null if the divider is 0.

2014-12-10 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4720:

Target Version/s: 1.3.0  (was: 1.2.0)

> Remainder should also return null if the divider is 0.
> --
>
> Key: SPARK-4720
> URL: https://issues.apache.org/jira/browse/SPARK-4720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> This is a follow-up of SPARK-4593.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4742) The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded

2014-12-10 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4742:

Target Version/s: 1.3.0  (was: 1.2.0)

> The name of Parquet File generated by AppendingParquetOutputFormat should be 
> zero padded
> 
>
> Key: SPARK-4742
> URL: https://issues.apache.org/jira/browse/SPARK-4742
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Sasaki Toru
>Priority: Minor
>
> When I use Parquet File as a output file using 
> ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded 
> while RDD#saveAsText does zero padding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4554) Set fair scheduler pool for JDBC client session in hive 13

2014-12-10 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4554.
-
Resolution: Duplicate

> Set fair scheduler pool for JDBC client session in hive 13
> --
>
> Key: SPARK-4554
> URL: https://issues.apache.org/jira/browse/SPARK-4554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
>Priority: Critical
> Fix For: 1.2.0
>
>
> Now hive 13 shim does not support to set fair scheduler pool 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242177#comment-14242177
 ] 

宿荣全 edited comment on SPARK-4817 at 12/11/14 6:19 AM:
--

[~srowen]
Always call foreachRDD, and operate on all of the RDD, and then call take on 
the RDD to get a few elements to print. It can achieve the effect, but it is 
more complicated.
for example：
1.val dstream = stream.map->filter->..foreachRDD(rdd => {
  val array = rdd.collect

  var result = Array[(String,String)]()

  result = if (array.size > 5) array.take(5) else array.take(array.size)

  result foreach println
})
2.val dstream = stream.map->filter->foreachRDD(rdd => {
  val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, 
String)]) => iter.toArray)

  val array = Array.concat(rddarray: _*)

  var result = Array[(String,String)]()

  result = if (array.size > 5) array.take(5) else array.take(array.size)

  result foreach println
})
This two samples can achieve the effect. From the design perspective streaming 
direct manipulation of the RDD is not a good design.and I think the method 
'foreachRDD'  is generally not used in coding. 
Generally when streaming register action by  through the following 6 methods. 
Those methods all called method 'foreachRDD'.
1.DStream.foreach
2.DStream.saveAsObjectFiles
3.DStream.saveAsTextFiles
4.PairDStreamFunctions.saveAsHadoopFiles
5.PairDStreamFunctions.saveAsNewAPIHadoopFiles


was (Author: surq):
[~srowen]
Always call foreachRDD, and operate on all of the RDD, and then call take on 
the RDD to get a few elements to print. It can achieve the effect, but it is 
more complicated.
for example：
```
1.val dstream = stream.map->filter->..foreachRDD(rdd => {
  val array = rdd.collect
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
2.val dstream = stream.map->filter->foreachRDD(rdd => {
  val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, 
String)]) => iter.toArray)
  val array = Array.concat(rddarray: _*)
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
```
This two samples can achieve the effect. From the design perspective streaming 
direct manipulation of the RDD is not a good design.and I think the method 
'foreachRDD'  is generally not used in coding. 
Generally when streaming register action by  through the following 6 methods. 
Those methods all called method 'foreachRDD'.
1.DStream.foreach
2.DStream.saveAsObjectFiles
3.DStream.saveAsTextFiles
4.PairDStreamFunctions.saveAsHadoopFiles
5.PairDStreamFunctions.saveAsNewAPIHadoopFiles

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4554) Set fair scheduler pool for JDBC client session in hive 13

2014-12-10 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4554:

Priority: Critical  (was: Major)

> Set fair scheduler pool for JDBC client session in hive 13
> --
>
> Key: SPARK-4554
> URL: https://issues.apache.org/jira/browse/SPARK-4554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
>Priority: Critical
> Fix For: 1.2.0
>
>
> Now hive 13 shim does not support to set fair scheduler pool 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet

2014-12-10 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3575:

Target Version/s: 1.3.0  (was: 1.2.0)

> Hive Schema is ignored when using convertMetastoreParquet
> -
>
> Key: SPARK-3575
> URL: https://issues.apache.org/jira/browse/SPARK-3575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Critical
>
> This can cause problems when for example one of the columns is defined as 
> TINYINT.  A class cast exception will be thrown since the parquet table scan 
> produces INTs while the rest of the execution is expecting bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3860) Improve dimension joins

2014-12-10 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3860:

Target Version/s: 1.3.0  (was: 1.2.0)

> Improve dimension joins
> ---
>
> Key: SPARK-3860
> URL: https://issues.apache.org/jira/browse/SPARK-3860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> This is an umbrella ticket for improving performance for joining multiple 
> dimension tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4699) Make caseSensitive configurable in Analyzer.scala

2014-12-10 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4699:

Target Version/s: 1.3.0  (was: 1.2.0)

> Make caseSensitive configurable in Analyzer.scala
> -
>
> Key: SPARK-4699
> URL: https://issues.apache.org/jira/browse/SPARK-4699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Jacky Li
> Fix For: 1.2.0
>
>
> Currently, case sensitivity is true by default in Analyzer. It should be 
> configurable by setting SQLConf in the client application



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4811) Custom UDTFs not working in Spark SQL

2014-12-10 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4811:

Fix Version/s: (was: 1.2.0)

> Custom UDTFs not working in Spark SQL
> -
>
> Key: SPARK-4811
> URL: https://issues.apache.org/jira/browse/SPARK-4811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Saurabh Santhosh
>Priority: Critical
>
> I am using the Thrift srever interface to Spark SQL and using beeline to 
> connect to it.
> I tried Spark SQL versions 1.1.0 and 1.1.1 and both are throwing the 
> following exception when using any custom UDTF.
> These are the steps i did :
> *Created a UDTF 'com.x.y.xxx'.*
> Registered the UDTF using following query : 
> *create temporary function xxx as 'com.x.y.xxx'*
> The registration went through without any errors. But when i tried executing 
> the UDTF i got the following error.
> *java.lang.ClassNotFoundException: xxx*
> Funny thing is that Its trying to load the function name instead of the 
> funtion class. The exception is at *line no: 81 in hiveudfs.scala*
> I have been at it for quite a long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3262) CREATE VIEW is not supported but the error message is not clear

2014-12-10 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3262.
-
   Resolution: Duplicate
Fix Version/s: 1.2.0

> CREATE VIEW is not supported but the error message is not clear
> ---
>
> Key: SPARK-3262
> URL: https://issues.apache.org/jira/browse/SPARK-3262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Michael Armbrust
> Fix For: 1.2.0
>
>
> In my case, it throws a "table not found" error. If this is not supported, we 
> should throw an "unsupported" error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread Yi Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242193#comment-14242193
 ] 

Yi Tian commented on SPARK-4817:


Hi, [~sowen]
I think the main idea of [~surq] is 
* For the people who just start to work on spark streaming (just like [~surq]), 
the {{print}} api is not clear enough for them to understand that only 10 lines 
of data will be handled. 
* In some cases, developers need a api to handle all the data and print the 
first few lines of all.


> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4825) CTAS fails to resolve when created using saveAsTable

2014-12-10 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-4825:
---

 Summary: CTAS fails to resolve when created using saveAsTable
 Key: SPARK-4825
 URL: https://issues.apache.org/jira/browse/SPARK-4825
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust
Assignee: Cheng Hao
Priority: Critical


While writing a test for a different issue, I found that saveAsTable seems to 
be broken:

{code}
  test("save join to table") {
val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, 
i.toString))
sql("CREATE TABLE test1 (key INT, value STRING)")
testData.insertInto("test1")
sql("CREATE TABLE test2 (key INT, value STRING)")
testData.insertInto("test2")
testData.insertInto("test2")
sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = 
b.key").saveAsTable("test")
checkAnswer(
  table("test"),
  sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = 
b.key").collect().toSeq)
  }

sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = 
b.key").saveAsTable("test")
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved plan 
found, tree:
'CreateTableAsSelect None, test, false, None
 Aggregate [], [COUNT(value#336) AS _c0#334L]
  Join Inner, Some((key#335 = key#339))
   MetastoreRelation default, test1, Some(a)
   MetastoreRelation default, test2, Some(b)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242177#comment-14242177
 ] 

宿荣全 edited comment on SPARK-4817 at 12/11/14 5:57 AM:
--

[~srowen]
Always call foreachRDD, and operate on all of the RDD, and then call take on 
the RDD to get a few elements to print. It can achieve the effect, but it is 
more complicated.
for example：
```
1.val dstream = stream.map->filter->..foreachRDD(rdd => {
  val array = rdd.collect
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
2.val dstream = stream.map->filter->foreachRDD(rdd => {
  val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, 
String)]) => iter.toArray)
  val array = Array.concat(rddarray: _*)
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
```
This two samples can achieve the effect. From the design perspective streaming 
direct manipulation of the RDD is not a good design.and I think the method 
'foreachRDD'  is generally not used in coding. 
Generally when streaming register action by  through the following 6 methods. 
Those methods all called method 'foreachRDD'.
1.DStream.foreach
2.DStream.saveAsObjectFiles
3.DStream.saveAsTextFiles
4.PairDStreamFunctions.saveAsHadoopFiles
5.PairDStreamFunctions.saveAsNewAPIHadoopFiles


was (Author: surq):
[~srowen]
Always call foreachRDD, and operate on all of the RDD, and then call take on 
the RDD to get a few elements to print.It can achieve the effect, but it is 
more complicated.
for example：
1.val dstream = stream.map->filter->..foreachRDD(rdd => {
  val array = rdd.collect
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
2.val dstream = stream.map->filter->foreachRDD(rdd => {
  val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, 
String)]) => iter.toArray)
  val array = Array.concat(rddarray: _*)
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
this two samples can achieve the effect. From the design perspective streaming 
direct manipulation of the RDD is not a good design.and I thank the method 
'foreachRDD'  is generally not used in coding. 
Generally when streaming register action by  through the following 6 
methods.Those methods all called method 'foreachRDD'.
1.DStream.foreach
2.DStream.saveAsObjectFiles
3.DStream.saveAsTextFiles
4.PairDStreamFunctions.saveAsHadoopFiles
5.PairDStreamFunctions.saveAsNewAPIHadoopFiles

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242177#comment-14242177
 ] 

宿荣全 edited comment on SPARK-4817 at 12/11/14 5:35 AM:
--

[~srowen]
Always call foreachRDD, and operate on all of the RDD, and then call take on 
the RDD to get a few elements to print.It can achieve the effect, but it is 
more complicated.
for example：
1.val dstream = stream.map->filter->..foreachRDD(rdd => {
  val array = rdd.collect
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
2.val dstream = stream.map->filter->foreachRDD(rdd => {
  val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, 
String)]) => iter.toArray)
  val array = Array.concat(rddarray: _*)
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
this two samples can achieve the effect. From the design perspective streaming 
direct manipulation of the RDD is not a good design.and I thank the method 
'foreachRDD'  is generally not used in coding. 
Generally when streaming register action by  through the following 6 
methods.Those methods all called method 'foreachRDD'.
1.DStream.foreach
2.DStream.saveAsObjectFiles
3.DStream.saveAsTextFiles
4.PairDStreamFunctions.saveAsHadoopFiles
5.PairDStreamFunctions.saveAsNewAPIHadoopFiles


was (Author: surq):
Always call foreachRDD, and operate on all of the RDD, and then call take on 
the RDD to get a few elements to print.It can achieve the effect, but it is 
more complicated.
for example：
1.val dstream = stream.map->filter->..foreachRDD(rdd => {
  val array = rdd.collect
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
2.val dstream = stream.map->filter->foreachRDD(rdd => {
  val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, 
String)]) => iter.toArray)
  val array = Array.concat(rddarray: _*)
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
this two samples can achieve the effect. From the design perspective streaming 
direct manipulation of the RDD is not a good design.and I thank the method 
'foreachRDD'  is generally not used in coding. 
Generally when streaming register action by  through the following 6 
methods.Those methods all called method 'foreachRDD'.
①.DStream.foreach
②.DStream.saveAsObjectFiles
③.DStream.saveAsTextFiles
④.PairDStreamFunctions.saveAsHadoopFiles
⑤.PairDStreamFunctions.saveAsNewAPIHadoopFiles

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242177#comment-14242177
 ] 

宿荣全 commented on SPARK-4817:


Always call foreachRDD, and operate on all of the RDD, and then call take on 
the RDD to get a few elements to print.It can achieve the effect, but it is 
more complicated.
for example：
1.val dstream = stream.map->filter->..foreachRDD(rdd => {
  val array = rdd.collect
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
2.val dstream = stream.map->filter->foreachRDD(rdd => {
  val rddarray = ssc.sparkContext.runJob(rdd, (iter: Iterator[(String, 
String)]) => iter.toArray)
  val array = Array.concat(rddarray: _*)
  var result = Array[(String,String)]()
  result = if (array.size > 5) array.take(5) else array.take(array.size)
  result foreach println
})
this two samples can achieve the effect. From the design perspective streaming 
direct manipulation of the RDD is not a good design.and I thank the method 
'foreachRDD'  is generally not used in coding. 
Generally when streaming register action by  through the following 6 
methods.Those methods all called method 'foreachRDD'.
①.DStream.foreach
②.DStream.saveAsObjectFiles
③.DStream.saveAsTextFiles
④.PairDStreamFunctions.saveAsHadoopFiles
⑤.PairDStreamFunctions.saveAsNewAPIHadoopFiles

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4700) Add Http support to Spark Thrift server

2014-12-10 Thread Judy Nash (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Judy Nash updated SPARK-4700:
-
Description: 
Currently thrift only supports TCP connection. 

The JIRA is to add HTTP support to spark thrift server in addition to the TCP 
protocol. Both TCP and HTTP are supported by Hive today. HTTP is more secure 
and used often in Windows. 

  was:
Currently thrift only supports TCP connection. 

The ask is to add HTTP connection as well. Both TCP and HTTP are supported by 
Hive today. 


> Add Http support to Spark Thrift server
> ---
>
> Key: SPARK-4700
> URL: https://issues.apache.org/jira/browse/SPARK-4700
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.2.1
> Environment: Linux and Windows
>Reporter: Judy Nash
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Currently thrift only supports TCP connection. 
> The JIRA is to add HTTP support to spark thrift server in addition to the TCP 
> protocol. Both TCP and HTTP are supported by Hive today. HTTP is more secure 
> and used often in Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-10 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242101#comment-14242101
 ] 

Reynold Xin commented on SPARK-4740:


I can't really think of a reason why the Netty one would performance worse than 
the old NIO one (maybe there are - but I can't think of it). Can you take a 
look at the io wait time to see if that is high? If not, then it is probably 
not seek time right?

> Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
> 
>
> Key: SPARK-4740
> URL: https://issues.apache.org/jira/browse/SPARK-4740
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Zhang, Liye
>Assignee: Reynold Xin
> Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
> Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
> sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
> 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
> sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
> TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
> node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z
>
>
> When testing current spark master (1.3.0-snapshot) with spark-perf 
> (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
> takes much longer time than NIO based shuffle transferService. The network 
> throughput of Netty is only about half of that of NIO. 
> We tested with standalone mode, and the data set we used for test is 20 
> billion records, and the total size is about 400GB. Spark-perf test is 
> Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
> executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242090#comment-14242090
 ] 

宿荣全 commented on SPARK-4817:


 [~srowen]
' Neither prints the "top" elements. Did you mean "first"?'
yes print first 'num' datas.

print and foreachRDD ultimately call {new ForEachDStream(this, 
context.sparkContext.clean(foreachFunc)).register()}.
The difference:
print's 'foreachFunc' is defined by streaming,foreachRDD's 'foreachFunc' is 
defined by developer.I think this method that "always call foreachRDD, and 
operate on all of the RDD, and then call take on the RDD to get a few elements 
to print." is the same as print,and do print function only,don't handle  all 
elements in RDD.
for example：
1.val dstream = stream.map->filter->.foreachRDD(rdd => {
  val result = rdd.take(11)
  result foreach println
})
2.val dstream = stream.map->filter->print
both of this two example all handle  11 datas,1 is println 11 datas. 2 is 
println 10datas and ""...".

So if want to handle all elements in RDD and print 'num' datas I thank this 
patch is very convenient and necessary.



> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4687) SparkContext#addFile doesn't keep file folder information

2014-12-10 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242086#comment-14242086
 ] 

Xuefu Zhang commented on SPARK-4687:


I concur with [~sandyr]'s account for the need of addFolder() as it helps any 
Spark application from doing the same thing over and over again, possibly in a 
less performant way. Folder is such an natural, indispensible to some extent, 
extension to file in any system that deals with bits on at storage level. I'd 
also contend that being hard to get it right shouldn't prevent us from trying 
and perfecting it on the way if we believe functionally it's right thing to add.

> SparkContext#addFile doesn't keep file folder information
> -
>
> Key: SPARK-4687
> URL: https://issues.apache.org/jira/browse/SPARK-4687
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Jimmy Xiang
>
> Files added with SparkContext#addFile are loaded with Utils#fetchFile before 
> a task starts. However, Utils#fetchFile puts all files under the Spart root 
> on the worker node. We should have an option to keep the folder information. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4818) Join operation should use iterator/lazy evaluation

2014-12-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242064#comment-14242064
 ] 

Apache Spark commented on SPARK-4818:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3671

> Join operation should use iterator/lazy evaluation
> --
>
> Key: SPARK-4818
> URL: https://issues.apache.org/jira/browse/SPARK-4818
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.1
>Reporter: Johannes Simon
>Priority: Minor
>
> The current implementation of the join operation does not use an iterator 
> (i.e. lazy evaluation), causing it to explicitly evaluate the co-grouped 
> values. In big data applications, these value collections can be very large. 
> This causes the *cartesian product of all co-grouped values* for a specific 
> key of both RDDs to be kept in memory during the flatMapValues operation, 
> resulting in an *O(size(pair._1)*size(pair._2))* memory consumption instead 
> of *O(1)*. Very large value collections will therefore cause "GC overhead 
> limit exceeded" exceptions and fail the task, or at least slow down execution 
> dramatically.
> {code:title=PairRDDFunctions.scala|borderStyle=solid}
> //...
> def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = 
> {
>   this.cogroup(other, partitioner).flatMapValues( pair =>
> for (v <- pair._1; w <- pair._2) yield (v, w)
>   )
> }
> //...
> {code}
> Since cogroup returns an Iterable instance of an Array, the join 
> implementation could be changed to the following, which uses lazy evaluation 
> instead, and has almost no memory overhead:
> {code:title=PairRDDFunctions.scala|borderStyle=solid}
> //...
> def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = 
> {
>   this.cogroup(other, partitioner).flatMapValues( pair =>
> for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
>   )
> }
> //...
> {code}
> Alternatively, if the current implementation is intentionally not using lazy 
> evaluation for some reason, there could be a *lazyJoin()* method next to the 
> original join implementation that utilizes lazy evaluation. This of course 
> applies to other join operations as well.
> Thanks! :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4824) Join should use `Iterator` rather than `Iterable`

2014-12-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu closed SPARK-4824.
---
Resolution: Duplicate

> Join should use `Iterator` rather than `Iterable`
> -
>
> Key: SPARK-4824
> URL: https://issues.apache.org/jira/browse/SPARK-4824
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> In Scala, `map` and `flatMap` of `Iterable` will copy the contents of 
> `Iterable` to a new `Seq`. Such as,
> {code}
>   val iterable = Seq(1, 2, 3).map(v => {
> println(v)
> v
>   })
>   println("Iterable map done")
>   val iterator = Seq(1, 2, 3).iterator.map(v => {
> println(v)
> v
>   })
>   println("Iterator map done")
> {code}
> outputed
> {code}
> 1
> 2
> 3
> Iterable map done
> Iterator map done
> {code}
> So we should use 'iterator' to reduce memory consumed by join.
> Found by [~johannes.simon]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4824) Join should use `Iterator` rather than `Iterable`

2014-12-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242056#comment-14242056
 ] 

Apache Spark commented on SPARK-4824:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3671

> Join should use `Iterator` rather than `Iterable`
> -
>
> Key: SPARK-4824
> URL: https://issues.apache.org/jira/browse/SPARK-4824
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> In Scala, `map` and `flatMap` of `Iterable` will copy the contents of 
> `Iterable` to a new `Seq`. Such as,
> {code}
>   val iterable = Seq(1, 2, 3).map(v => {
> println(v)
> v
>   })
>   println("Iterable map done")
>   val iterator = Seq(1, 2, 3).iterator.map(v => {
> println(v)
> v
>   })
>   println("Iterator map done")
> {code}
> outputed
> {code}
> 1
> 2
> 3
> Iterable map done
> Iterator map done
> {code}
> So we should use 'iterator' to reduce memory consumed by join.
> Found by [~johannes.simon]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4687) SparkContext#addFile doesn't keep file folder information

2014-12-10 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242055#comment-14242055
 ] 

Sandy Ryza commented on SPARK-4687:
---

I think [~xuefuz] can probably motivate this better, but from what I 
understand, the main use case is Hive's map joins and map bucket joins, in 
which a smaller table needs to be distributed to every node.  The smaller table 
typically resides in HDFS, and is the output of a separate job.  For map joins, 
the smaller table is composed of a bunch of files in a single folder.  For map 
bucket joins, the smaller table is composed of a single folder with a bunch of 
bucket folders underneath, each containing a set of data files.  At the very 
least, doing the prefixing would require a bunch of extra FS operations to 
rename all the subfiles.  Though that might make them difficult to read from 
other Hive implementations?

Another totally separate situation I encountered a while ago where this kind of 
thing would have been useful was calling http://ctakes.apache.org/ in a 
distributed fashion.  Calling into it requires letting it load a bunch of files 
from a particular directory structure.  We ultimately had to go with a 
workaround that required installing the directory on every node.

Beyond the issues I outlined in my patch, are there particular edge cases 
you're worried about where we wouldn't be able to copy the behavior from 
addFile?


> SparkContext#addFile doesn't keep file folder information
> -
>
> Key: SPARK-4687
> URL: https://issues.apache.org/jira/browse/SPARK-4687
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Jimmy Xiang
>
> Files added with SparkContext#addFile are loaded with Utils#fetchFile before 
> a task starts. However, Utils#fetchFile puts all files under the Spart root 
> on the worker node. We should have an option to keep the folder information. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4824) Join should use `Iterator` rather than `Iterable`

2014-12-10 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-4824:
---

 Summary: Join should use `Iterator` rather than `Iterable`
 Key: SPARK-4824
 URL: https://issues.apache.org/jira/browse/SPARK-4824
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu


In Scala, `map` and `flatMap` of `Iterable` will copy the contents of 
`Iterable` to a new `Seq`. Such as,
{code}
  val iterable = Seq(1, 2, 3).map(v => {
println(v)
v
  })
  println("Iterable map done")

  val iterator = Seq(1, 2, 3).iterator.map(v => {
println(v)
v
  })
  println("Iterator map done")
{code}
outputed
{code}
1
2
3
Iterable map done
Iterator map done
{code}
So we should use 'iterator' to reduce memory consumed by join.

Found by [~johannes.simon]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel

2014-12-10 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242034#comment-14242034
 ] 

Debasish Das commented on SPARK-4675:
-

[~josephkb] how do we validate that low dimension space is giving more 
meaningful similarities than the feature space (which is sparse) ?

> Find similar products and similar users in MatrixFactorizationModel
> ---
>
> Key: SPARK-4675
> URL: https://issues.apache.org/jira/browse/SPARK-4675
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Steven Bourke
>Priority: Trivial
>  Labels: mllib, recommender
>
> Using the latent feature space that is learnt in MatrixFactorizationModel, I 
> have added 2 new functions to find similar products and similar users. A user 
> of the API can for example pass a product ID, and get the closest products. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4823) rowSimilarities

2014-12-10 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242031#comment-14242031
 ] 

Debasish Das commented on SPARK-4823:
-

I am considering coming up with a baseline version that's very close to brute 
force but we cut the computation with a topK number...for each user come up 
with topK users where K is defined by the client..this will take care of matrix 
factorization use-case...

Basically on master we collect a set of user factors, broadcast it to every 
node and does a reduceByKey to generate topK users for each user from this user 
block...We send a kernel function (cosine / polynomial / rbf) in this 
calculation...

But this idea does not work for raw features right...If we do map features to a 
lower dimension using factorization then this approach should run fine...but I 
am not sure if we can ask users to map their data into a lower dimension

Is it possible to bring in ideas from fastfood and kitchen sink to do this ?


> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger

2014-12-10 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242030#comment-14242030
 ] 

Michael Armbrust commented on SPARK-4814:
-

Hive throws a lot of warning here for some reason even when reading data from 
their own test files.  One option here would be to just override the 
configuration only for the hive subproject.

{code}
// Hive throws assertion errors for tests that pass.
javaOptions in Test := (javaOptions in Test).value.filterNot(_ == "-ea"),
{code}

> Enable assertions in SBT, Maven tests / AssertionError from Hive's 
> LazyBinaryInteger
> 
>
> Key: SPARK-4814
> URL: https://issues.apache.org/jira/browse/SPARK-4814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.1.0
>Reporter: Sean Owen
>
> Follow up to SPARK-4159, wherein we noticed that Java tests weren't running 
> in Maven, in part because a Java test actually fails with {{AssertionError}}. 
> That code/test was fixed in SPARK-4850.
> The reason it wasn't caught by SBT tests was that they don't run with 
> assertions on, and Maven's surefire does.
> Turning on assertions in the SBT build is trivial, adding one line:
> {code}
> javaOptions in Test += "-ea",
> {code}
> This reveals a test failure in Scala test suites though:
> {code}
> [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds)
> [info]   Failed to execute query using catalyst:
> [info]   Error: Job aborted due to stage failure: Task 1 in stage 551.0 
> failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, 
> localhost): java.lang.AssertionError
> [info]at 
> org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51)
> [info]at 
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110)
> [info]at 
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171)
> [info]at 
> org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166)
> [info]at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318)
> [info]at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314)
> [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> [info]at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132)
> [info]at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
> [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
> [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
> [info]at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> [info]at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
> [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
> [info]at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> [info]at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
> [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
> [info]at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> [info]at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> [info]at org.apache.spark.scheduler.Task.run(Task.scala:56)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [info]at java.lang.Thread.run(Thread.java:745)
> {code}
> The items for this JIRA are therefore:
> - Enable assertions in SBT
> - Fix this failure
> - Figure out why Maven scalatest didn't trigger it - may need assertions 
> explicitly turned on too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4823) rowSimilarities

2014-12-10 Thread Reza Zadeh (JIRA)

Reza Zadeh created SPARK-4823:
-

 Summary: rowSimilarities
 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh


RowMatrix has a columnSimilarities method to find cosine similarities between 
columns.

A rowSimilarities method would be useful to find similarities between rows.

This is JIRA is to investigate which algorithms are suitable for such a method, 
better than brute-forcing it. Note that when there are many rows (> 10^6), it 
is unlikely that brute-force will be feasible, since the output will be of 
order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger

2014-12-10 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241983#comment-14241983
 ] 

Cheng Lian commented on SPARK-4814:
---

Thanks [~srowen]! I'll take a look.

> Enable assertions in SBT, Maven tests / AssertionError from Hive's 
> LazyBinaryInteger
> 
>
> Key: SPARK-4814
> URL: https://issues.apache.org/jira/browse/SPARK-4814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.1.0
>Reporter: Sean Owen
>
> Follow up to SPARK-4159, wherein we noticed that Java tests weren't running 
> in Maven, in part because a Java test actually fails with {{AssertionError}}. 
> That code/test was fixed in SPARK-4850.
> The reason it wasn't caught by SBT tests was that they don't run with 
> assertions on, and Maven's surefire does.
> Turning on assertions in the SBT build is trivial, adding one line:
> {code}
> javaOptions in Test += "-ea",
> {code}
> This reveals a test failure in Scala test suites though:
> {code}
> [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds)
> [info]   Failed to execute query using catalyst:
> [info]   Error: Job aborted due to stage failure: Task 1 in stage 551.0 
> failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, 
> localhost): java.lang.AssertionError
> [info]at 
> org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51)
> [info]at 
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110)
> [info]at 
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171)
> [info]at 
> org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166)
> [info]at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318)
> [info]at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314)
> [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> [info]at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132)
> [info]at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
> [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
> [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
> [info]at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> [info]at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
> [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
> [info]at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> [info]at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
> [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
> [info]at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> [info]at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> [info]at org.apache.spark.scheduler.Task.run(Task.scala:56)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [info]at java.lang.Thread.run(Thread.java:745)
> {code}
> The items for this JIRA are therefore:
> - Enable assertions in SBT
> - Fix this failure
> - Figure out why Maven scalatest didn't trigger it - may need assertions 
> explicitly turned on too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-10 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241973#comment-14241973
 ] 

Saisai Shao edited comment on SPARK-4740 at 12/11/14 1:34 AM:
--

Hi Reynold, the code I pasted is just the example, I do convert ByteBuffer to 
Netty ByteBuf. We will test again using your patch to see the difference.

As we tested using ram disk to minimize the disk effect, the performance of 
Netty is similar to NIO, can we say that current implementation of Netty 
transfer service is not well tuned for slow random IO device like spinned disk? 
Also from my understanding I guess that the unbalanced situation might be 
introduced by random disk IO latency, since SSDs and ramdisk outperforms much 
better than spinned disk in random seek, which minimize the IO latency 
vibrance, and the difference of system configuration before and after is only 
the spinned disk to ram disk.

So what is your opinion?


was (Author: jerryshao):
Hi Reynold, the code I pasted is just the example, I do convert ByteBuffer to 
Netty ByteBuf. We will test again using your patch to see the difference.

As we tested using ram disk to minimize the disk effect, the performance of 
Netty is similar to NIO, can we say that current implementation of Netty 
transfer service is not well tuned for slow random IO device like spinned disk. 
Also from my understanding I guess that the unbalanced situation might be 
introduced by random disk IO latency, since SSDs and ramdisk outperforms much 
better than spinned disk in random seek, which minimize the IO latency 
vibrance, and the difference of system configuration before and after is only 
the spinned disk to ram disk.

So what is your opinion?

> Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
> 
>
> Key: SPARK-4740
> URL: https://issues.apache.org/jira/browse/SPARK-4740
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Zhang, Liye
>Assignee: Reynold Xin
> Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
> Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
> sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
> 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
> sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
> TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
> node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z
>
>
> When testing current spark master (1.3.0-snapshot) with spark-perf 
> (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
> takes much longer time than NIO based shuffle transferService. The network 
> throughput of Netty is only about half of that of NIO. 
> We tested with standalone mode, and the data set we used for test is 20 
> billion records, and the total size is about 400GB. Spark-perf test is 
> Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
> executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-10 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241973#comment-14241973
 ] 

Saisai Shao commented on SPARK-4740:


Hi Reynold, the code I pasted is just the example, I do convert ByteBuffer to 
Netty ByteBuf. We will test again using your patch to see the difference.

As we tested using ram disk to minimize the disk effect, the performance of 
Netty is similar to NIO, can we say that current implementation of Netty 
transfer service is not well tuned for slow random IO device like spinned disk. 
Also from my understanding I guess that the unbalanced situation might be 
introduced by random disk IO latency, since SSDs and ramdisk outperforms much 
better than spinned disk in random seek, which minimize the IO latency 
vibrance, and the difference of system configuration before and after is only 
the spinned disk to ram disk.

So what is your opinion?

> Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
> 
>
> Key: SPARK-4740
> URL: https://issues.apache.org/jira/browse/SPARK-4740
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Zhang, Liye
>Assignee: Reynold Xin
> Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
> Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
> sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
> 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
> sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
> TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
> node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z
>
>
> When testing current spark master (1.3.0-snapshot) with spark-perf 
> (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
> takes much longer time than NIO based shuffle transferService. The network 
> throughput of Netty is only about half of that of NIO. 
> We tested with standalone mode, and the data set we used for test is 20 
> billion records, and the total size is about 400GB. Spark-perf test is 
> Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
> executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4687) SparkContext#addFile doesn't keep file folder information

2014-12-10 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241974#comment-14241974
 ] 

Patrick Wendell commented on SPARK-4687:


I commented a bit on the JIRA after seeing this. Instead of having this can 
Hive just add individual files with a prefix denoting what the folder is? 
Maintaining addFile has been nontrivial (we found a lot of corner cases over 
time) so it would be good to understand what the alternative is for Hive.

> SparkContext#addFile doesn't keep file folder information
> -
>
> Key: SPARK-4687
> URL: https://issues.apache.org/jira/browse/SPARK-4687
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Jimmy Xiang
>
> Files added with SparkContext#addFile are loaded with Utils#fetchFile before 
> a task starts. However, Utils#fetchFile puts all files under the Spart root 
> on the worker node. We should have an option to keep the folder information. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4687) SparkContext#addFile doesn't keep file folder information

2014-12-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241961#comment-14241961
 ] 

Apache Spark commented on SPARK-4687:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/3670

> SparkContext#addFile doesn't keep file folder information
> -
>
> Key: SPARK-4687
> URL: https://issues.apache.org/jira/browse/SPARK-4687
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Jimmy Xiang
>
> Files added with SparkContext#addFile are loaded with Utils#fetchFile before 
> a task starts. However, Utils#fetchFile puts all files under the Spart root 
> on the worker node. We should have an option to keep the folder information. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel

2014-12-10 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241952#comment-14241952
 ] 

Joseph K. Bradley commented on SPARK-4675:
--

Just to make sure I get your last question, are you asking, "Why compute 
product similarities using the low-dimensional space when we could do it in the 
high-dimensional space?"  If so, then my understanding is that the 
low-dimensional space will give more meaningful similarities in general.

> Find similar products and similar users in MatrixFactorizationModel
> ---
>
> Key: SPARK-4675
> URL: https://issues.apache.org/jira/browse/SPARK-4675
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Steven Bourke
>Priority: Trivial
>  Labels: mllib, recommender
>
> Using the latent feature space that is learnt in MatrixFactorizationModel, I 
> have added 2 new functions to find similar products and similar users. A user 
> of the API can for example pass a product ID, and get the closest products. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped

2014-12-10 Thread Ilayaperumal Gopinathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241951#comment-14241951
 ] 

Ilayaperumal Gopinathan commented on SPARK-2892:


To add more info:

When the ReceiverTracker sends the "StopReceiver" message to the receiver actor 
at the executor, it's ReceiverLauncher thread always times out and I notice the 
corresponding job is cancelled only because of stopping the DAGScheduler. This 
throws the exception[1] while at the executor side the worker node throws this 
info[2]

Exception[1]:
INFO sparkDriver-akka.actor.default-dispatcher-14 
cluster.SparkDeploySchedulerBackend - Asking each executor to shut down
15:06:53,783 1.1.0.SNAP  INFO Thread-40 scheduler.DAGScheduler - Job 1 failed: 
start at SparkDriver.java:109, took 72.739141 s
Exception in thread "Thread-40" org.apache.spark.SparkException: Job cancelled 
because SparkContext was shut down
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at 
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:701)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1428)
at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundPostStop(DAGScheduler.scala:1375)
at 
akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
at 
akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
at akka.actor.ActorCell.terminate(ActorCell.scala:369)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

info[2]
INFO LocalActorRef: Message 
[akka.remote.transport.AssociationHandle$Disassociated] from 
Actor[akka://sparkWorker/deadLetters] to 
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40127.0.0.1%3A52219-2#1619424601]
 was not delivered. [1] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
14/12/10 15:06:53 ERROR EndpointWriter: AssociationError 
[akka.tcp://sparkWorker@localhost:51262] <- 
[akka.tcp://sparkExecutor@localhost:52217]: Error [Shut down address: 
akka.tcp://sparkExecutor@localhost:52217] [
akka.remote.ShutDownAssociation: Shut down address: 
akka.tcp://sparkExecutor@localhost:52217
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The 
remote system terminated the association because it is shutting down.

Please note that I am running everything on localhost and the same thing works 
fine on "local" mode and the above issue only arises on "cluster" mode. I tried 
changing the hostname to 127.0.0.1 but noticed the same.

Any clues on what might be going on here would help a lot.
Thanks!

> Socket Receiver does not stop when streaming context is stopped
> ---
>
> Key: SPARK-2892
> URL: https://issues.apache.org/jira/browse/SPARK-2892
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Running NetworkWordCount with
> {quote}  
> ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); 
> Thread.sleep(6)
> {quote}
> gives the following error
> {quote}
> 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
> in 10047 ms on localhost (1/1)
> 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at 
> ReceiverTracker.scala:275) finished in 10.056 s
> 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
> have all completed, from pool
> 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at 
> ReceiverTracker.scala:275, took 1

[jira] [Created] (SPARK-4822) Use sphinx tags for Python doc annotations

2014-12-10 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-4822:


 Summary: Use sphinx tags for Python doc annotations
 Key: SPARK-4822
 URL: https://issues.apache.org/jira/browse/SPARK-4822
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor


Currently, pyspark documentation uses the same annotations as in Scala:
{code}
:: Experimental ::
{code}
It should use Sphinx annotations:
{code}
.. note:: Experimental
{code}
The latter method creates a box.  The former method must either be on the same 
line as the rest of the doc text, or it generates compilation warnings.

Proposal: Change all annotations in Python docs to use "note"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3526) Docs section on data locality

2014-12-10 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3526.

   Resolution: Fixed
Fix Version/s: 1.2.0

Thanks [~aash] for contributing.

> Docs section on data locality
> -
>
> Key: SPARK-3526
> URL: https://issues.apache.org/jira/browse/SPARK-3526
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.0.2
>Reporter: Andrew Ash
>Assignee: Andrew Ash
> Fix For: 1.2.0
>
>
> Several threads on the mailing list have been about data locality and how to 
> interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc.  Let's get some more 
> details in the docs on this concept so we can point future questions there.
> A couple people appreciated the below description of locality so it could be 
> a good starting point:
> {quote}
> The locality is how close the data is to the code that's processing it.  
> PROCESS_LOCAL means data is in the same JVM as the code that's running, so 
> it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the same 
> node, or in another executor on the same node, so is a little slower because 
> the data has to travel across an IPC connection.  RACK_LOCAL is even slower 
> -- data is on a different server so needs to be sent over the network.
> Spark switches to lower locality levels when there's no unprocessed data on a 
> node that has idle CPUs.  In that situation you have two options: wait until 
> the busy CPUs free up so you can start another task that uses data on that 
> server, or start a new task on a farther away server that needs to bring data 
> from that remote place.  What Spark typically does is wait a bit in the hopes 
> that a busy CPU frees up.  Once that timeout expires, it starts moving the 
> data from far away to the free CPU.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4633) Support gzip in spark.compression.io.codec

2014-12-10 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell closed SPARK-4633.
--
Resolution: Won't Fix

I'd like to close this issue for now until we get a better understanding of the 
needs. I'm not against adding more codecs, it would just be good to have some 
profiled reason why we are doing it since as I said, it has support overhead.

We can re-open this if there is more interest.

> Support gzip in spark.compression.io.codec
> --
>
> Key: SPARK-4633
> URL: https://issues.apache.org/jira/browse/SPARK-4633
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Takeshi Yamamuro
>Priority: Trivial
>
> gzip is widely used in other frameowrks such as hadoop mapreduce and tez, and 
> also
> I think that gizip is more stable than other codecs in terms of both 
> performance
> and space overheads.
> I have one open question; current spark configuratios have a block size option
> for each codec (spark.io.compression.[gzip|lz4|snappy].block.size).
> As # of codecs increases, the configurations have more options and
> I think that it is sort of complicated for non-expert users.
> To mitigate it, my thought follows;
> the three configurations are replaced with a single option for block size
> (spark.io.compression.block.size). Then, 'Meaning' in configurations
> will describe "This option makes an effect on gzip, lz4, and snappy. 
> Block size (in bytes) used in compression, in the case when these compression
> codecs are used. Lowering...".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4821) pyspark.mllib.rand docs not generated correctly

2014-12-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241871#comment-14241871
 ] 

Apache Spark commented on SPARK-4821:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/3669

> pyspark.mllib.rand docs not generated correctly
> ---
>
> Key: SPARK-4821
> URL: https://issues.apache.org/jira/browse/SPARK-4821
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, MLlib, PySpark
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> spark/python/docs/pyspark.mllib.rst needs to be updated to reflect the change 
> in package names from pyspark.mllib.random to .rand
> Otherwise, the Python API docs are empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4759) Deadlock in complex spark job in local mode

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4759:
-
Fix Version/s: 1.1.2
   1.3.0

> Deadlock in complex spark job in local mode
> ---
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0, 1.3.0
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.3.0, 1.1.2
>
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   # taking new data merging it with the previous result
>   # caching and checkpointing the new result
>   # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4759) Deadlock in complex spark job in local mode

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4759:
-
Labels: backport-needed  (was: )

> Deadlock in complex spark job in local mode
> ---
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0, 1.3.0
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
>Assignee: Andrew Or
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.3.0, 1.1.2
>
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   # taking new data merging it with the previous result
>   # caching and checkpointing the new result
>   # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4759) Deadlock in complex spark job in local mode

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4759:
-
Target Version/s: 1.3.0, 1.1.2, 1.2.1

> Deadlock in complex spark job in local mode
> ---
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0, 1.3.0
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
>Assignee: Andrew Or
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.3.0, 1.1.2
>
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   # taking new data merging it with the previous result
>   # caching and checkpointing the new result
>   # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4821) pyspark.mllib.rand docs not generated correctly

2014-12-10 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-4821:


 Summary: pyspark.mllib.rand docs not generated correctly
 Key: SPARK-4821
 URL: https://issues.apache.org/jira/browse/SPARK-4821
 Project: Spark
  Issue Type: Bug
  Components: Documentation, MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


spark/python/docs/pyspark.mllib.rst needs to be updated to reflect the change 
in package names from pyspark.mllib.random to .rand

Otherwise, the Python API docs are empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4569) Rename "externalSorting" in Aggregator

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4569:
-
Labels: backport-needed  (was: )

> Rename "externalSorting" in Aggregator
> --
>
> Key: SPARK-4569
> URL: https://issues.apache.org/jira/browse/SPARK-4569
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Priority: Trivial
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> While technically all spilling in Spark does result in sorting, calling this 
> variable externalSorting makes it seem like ExternalSorter will be used, when 
> in fact it just means whether spilling is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4569) Rename "externalSorting" in Aggregator

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4569:
-
Target Version/s: 1.3.0, 1.1.2, 1.2.1
   Fix Version/s: 1.3.0

> Rename "externalSorting" in Aggregator
> --
>
> Key: SPARK-4569
> URL: https://issues.apache.org/jira/browse/SPARK-4569
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Priority: Trivial
> Fix For: 1.3.0
>
>
> While technically all spilling in Spark does result in sorting, calling this 
> variable externalSorting makes it seem like ExternalSorter will be used, when 
> in fact it just means whether spilling is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4820) Spark build encounters "File name too long" on some encrypted filesystems

2014-12-10 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4820:
---
Description: 
This was reported by Luchesar Cekov on github along with a proposed fix. The 
fix has some potential downstream issues (it will modify the classnames) so 
until we understand better how many users are affected we aren't going to merge 
it. However, I'd like to include the issue and workaround here. If you 
encounter this issue please comment on the JIRA so we can assess the frequency.

The issue produces this error:
{code}
[error] == Expanded type of tree ==
[error] 
[error] ConstantType(value = Constant(Throwable))
[error] 
[error] uncaught exception during compilation: java.io.IOException
[error] File name too long
[error] two errors found
{code}

The workaround is in maven under the compile options add: 

{code}
+  -Xmax-classfile-name
+  128
{code}

In SBT add:

{code}
+scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
{code}


  was:
This was reported by Luchesar Cekov on github along with a proposed fix. The 
fix has some potential downstream issues (it will modify the classnames) so 
until we understand better how many users are affected we aren't going to merge 
it. However, I'd like to include the issue and workaround here.

The issue produces this error:
{code}
[error] == Expanded type of tree ==
[error] 
[error] ConstantType(value = Constant(Throwable))
[error] 
[error] uncaught exception during compilation: java.io.IOException
[error] File name too long
[error] two errors found
{code}

The workaround is in maven under the compile options add: 

{code}
+  -Xmax-classfile-name
+  128
{code}

In SBT add:

{code}
+scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
{code}



> Spark build encounters "File name too long" on some encrypted filesystems
> -
>
> Key: SPARK-4820
> URL: https://issues.apache.org/jira/browse/SPARK-4820
> Project: Spark
>  Issue Type: Bug
>Reporter: Patrick Wendell
>
> This was reported by Luchesar Cekov on github along with a proposed fix. The 
> fix has some potential downstream issues (it will modify the classnames) so 
> until we understand better how many users are affected we aren't going to 
> merge it. However, I'd like to include the issue and workaround here. If you 
> encounter this issue please comment on the JIRA so we can assess the 
> frequency.
> The issue produces this error:
> {code}
> [error] == Expanded type of tree ==
> [error] 
> [error] ConstantType(value = Constant(Throwable))
> [error] 
> [error] uncaught exception during compilation: java.io.IOException
> [error] File name too long
> [error] two errors found
> {code}
> The workaround is in maven under the compile options add: 
> {code}
> +  -Xmax-classfile-name
> +  128
> {code}
> In SBT add:
> {code}
> +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4820) Spark build encounters "File name too long" on some encrypted filesystems

2014-12-10 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-4820:
--

 Summary: Spark build encounters "File name too long" on some 
encrypted filesystems
 Key: SPARK-4820
 URL: https://issues.apache.org/jira/browse/SPARK-4820
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell


This was reported by Luchesar Cekov on github along with a proposed fix. The 
fix has some potential downstream issues (it will modify the classnames) so 
until we understand better how many users are affected we aren't going to merge 
it. However, I'd like to include the issue and workaround here.

The issue produces this error:
{code}
[error] == Expanded type of tree ==
[error] 
[error] ConstantType(value = Constant(Throwable))
[error] 
[error] uncaught exception during compilation: java.io.IOException
[error] File name too long
[error] two errors found
{code}

The workaround is in maven under the compile options add: 

{code}
+  -Xmax-classfile-name
+  128
{code}

In SBT add:

{code}
+scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4819) Remove Guava's "Optional" from public API

2014-12-10 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-4819:
-

 Summary: Remove Guava's "Optional" from public API
 Key: SPARK-4819
 URL: https://issues.apache.org/jira/browse/SPARK-4819
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin


Filing this mostly so this isn't forgotten. Spark currently exposes Guava types 
in its public API (the {{Optional}} class is used in the Java bindings). This 
makes it hard to properly hide Guava from user applications, and makes mixing 
different Guava versions with Spark a little sketchy (even if things should 
work, since those classes are pretty simple in general).

Since this changes the public API, it has to be done in a release that allows 
such breakages. But it would be nice to at least have a transition plan for 
deprecating the affected APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4793) way to find assembly jar is too strict

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4793:
-
Target Version/s: 1.3.0, 1.1.2, 1.2.1
   Fix Version/s: 1.3.0

> way to find assembly jar is too strict
> --
>
> Key: SPARK-4793
> URL: https://issues.apache.org/jira/browse/SPARK-4793
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.1.0
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Minor
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4793) way to find assembly jar is too strict

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4793:
-
Affects Version/s: 1.1.0

> way to find assembly jar is too strict
> --
>
> Key: SPARK-4793
> URL: https://issues.apache.org/jira/browse/SPARK-4793
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.1.0
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Minor
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4793) way to find assembly jar is too strict

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4793:
-
Assignee: Adrian Wang

> way to find assembly jar is too strict
> --
>
> Key: SPARK-4793
> URL: https://issues.apache.org/jira/browse/SPARK-4793
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4215) Allow requesting executors only on Yarn (for now)

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4215:
-
Fix Version/s: 1.3.0

> Allow requesting executors only on Yarn (for now)
> -
>
> Key: SPARK-4215
> URL: https://issues.apache.org/jira/browse/SPARK-4215
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> Currently if the user attempts to call `sc.requestExecutors` or enables 
> dynamic allocation on, say, standalone mode, it just fails silently. We must 
> at the very least log a warning to say it's only available for Yarn 
> currently, or maybe even throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4215) Allow requesting executors only on Yarn (for now)

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4215:
-
Labels: backport-needed  (was: )

> Allow requesting executors only on Yarn (for now)
> -
>
> Key: SPARK-4215
> URL: https://issues.apache.org/jira/browse/SPARK-4215
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> Currently if the user attempts to call `sc.requestExecutors` or enables 
> dynamic allocation on, say, standalone mode, it just fails silently. We must 
> at the very least log a warning to say it's only available for Yarn 
> currently, or maybe even throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-12-10 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241701#comment-14241701
 ] 

Pat Ferrel commented on SPARK-2075:
---

If the explanation is correct this needs to be filed against Spark as putting 
the wrong or not enough artifacts into maven repos. There would need to be a 
different artifact for every config option that will change internal naming.

I can't understand why lots of people aren't running into this, all it requires 
is that you link against the repo artifact and run against a user compiled 
Spark.

> Anonymous classes are missing from Spark distribution
> -
>
> Key: SPARK-2075
> URL: https://issues.apache.org/jira/browse/SPARK-2075
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.0.0
>Reporter: Paul R. Brown
>Priority: Critical
>
> Running a job built against the Maven dep for 1.0.0 and the hadoop1 
> distribution produces:
> {code}
> java.lang.ClassNotFoundException:
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
> {code}
> Here's what's in the Maven dep as of 1.0.0:
> {code}
> jar tvf 
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>  | grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}
> And here's what's in the hadoop1 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
> {code}
> I.e., it's not there.  It is in the hadoop2 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4771) Document standalone --supervise feature

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4771.

   Resolution: Fixed
Fix Version/s: 1.2.1
   1.1.2

> Document standalone --supervise feature
> ---
>
> Key: SPARK-4771
> URL: https://issues.apache.org/jira/browse/SPARK-4771
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.3.0, 1.1.2, 1.2.1
>
>
> We need this especially for streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4771) Document standalone --supervise feature

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4771:
-
Fix Version/s: 1.3.0

> Document standalone --supervise feature
> ---
>
> Key: SPARK-4771
> URL: https://issues.apache.org/jira/browse/SPARK-4771
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.3.0
>
>
> We need this especially for streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4771) Document standalone --supervise feature

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4771:
-
Target Version/s: 1.3.0, 1.1.2, 1.2.1  (was: 1.1.2, 1.2.1)

> Document standalone --supervise feature
> ---
>
> Key: SPARK-4771
> URL: https://issues.apache.org/jira/browse/SPARK-4771
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.3.0
>
>
> We need this especially for streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4329) Add indexing feature for HistoryPage

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4329.

Resolution: Fixed
  Assignee: Kousuke Saruta

> Add indexing feature for HistoryPage
> 
>
> Key: SPARK-4329
> URL: https://issues.apache.org/jira/browse/SPARK-4329
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
> Fix For: 1.3.0
>
>
> Current HistoryPage have links only to previous page or next page.
> I suggest to add index to access history pages easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4329) Add indexing feature for HistoryPage

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4329:
-
Fix Version/s: 1.3.0

> Add indexing feature for HistoryPage
> 
>
> Key: SPARK-4329
> URL: https://issues.apache.org/jira/browse/SPARK-4329
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
> Fix For: 1.3.0
>
>
> Current HistoryPage have links only to previous page or next page.
> I suggest to add index to access history pages easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4161) Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4161:
-
Labels: backport-needed  (was: )

> Spark shell class path is not correctly set if "spark.driver.extraClassPath" 
> is set in defaults.conf
> 
>
> Key: SPARK-4161
> URL: https://issues.apache.org/jira/browse/SPARK-4161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0, 1.2.0
> Environment: Mac, Ubuntu
>Reporter: Shay Seng
>Assignee: Guoqiang Li
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> (1) I want to launch a spark-shell + with jars that are only required by the 
> driver (ie. not shipped to slaves)
>  
> (2) I added "spark.driver.extraClassPath  /mypath/to.jar" to my 
> spark-defaults.conf
> I launched spark-shell with:  ./spark-shell
> Here I see on the WebUI that spark.driver.extraClassPath has been set, but I 
> am NOT able to access any methods in the jar.
> (3) I removed "spark.driver.extraClassPath" from my spark-default.conf
> I launched spark-shell with  ./spark-shell --driver.class.path /mypath/to.jar
> Again I see that the WebUI spark.driver.extraClassPath has been set. 
> But this time I am able to access the methods in the jar. 
> Looks like when the driver class path is loaded from spark-default.conf, the 
> REPL's classpath is not correctly appended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4161) Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4161:
-
Fix Version/s: 1.3.0

> Spark shell class path is not correctly set if "spark.driver.extraClassPath" 
> is set in defaults.conf
> 
>
> Key: SPARK-4161
> URL: https://issues.apache.org/jira/browse/SPARK-4161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0, 1.2.0
> Environment: Mac, Ubuntu
>Reporter: Shay Seng
>Assignee: Guoqiang Li
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> (1) I want to launch a spark-shell + with jars that are only required by the 
> driver (ie. not shipped to slaves)
>  
> (2) I added "spark.driver.extraClassPath  /mypath/to.jar" to my 
> spark-defaults.conf
> I launched spark-shell with:  ./spark-shell
> Here I see on the WebUI that spark.driver.extraClassPath has been set, but I 
> am NOT able to access any methods in the jar.
> (3) I removed "spark.driver.extraClassPath" from my spark-default.conf
> I launched spark-shell with  ./spark-shell --driver.class.path /mypath/to.jar
> Again I see that the WebUI spark.driver.extraClassPath has been set. 
> But this time I am able to access the methods in the jar. 
> Looks like when the driver class path is loaded from spark-default.conf, the 
> REPL's classpath is not correctly appended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4161) Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf

2014-12-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4161:
-
Target Version/s: 1.3.0, 1.1.2, 1.2.1  (was: 1.1.2, 1.2.1)

> Spark shell class path is not correctly set if "spark.driver.extraClassPath" 
> is set in defaults.conf
> 
>
> Key: SPARK-4161
> URL: https://issues.apache.org/jira/browse/SPARK-4161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0, 1.2.0
> Environment: Mac, Ubuntu
>Reporter: Shay Seng
>Assignee: Guoqiang Li
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> (1) I want to launch a spark-shell + with jars that are only required by the 
> driver (ie. not shipped to slaves)
>  
> (2) I added "spark.driver.extraClassPath  /mypath/to.jar" to my 
> spark-defaults.conf
> I launched spark-shell with:  ./spark-shell
> Here I see on the WebUI that spark.driver.extraClassPath has been set, but I 
> am NOT able to access any methods in the jar.
> (3) I removed "spark.driver.extraClassPath" from my spark-default.conf
> I launched spark-shell with  ./spark-shell --driver.class.path /mypath/to.jar
> Again I see that the WebUI spark.driver.extraClassPath has been set. 
> But this time I am able to access the methods in the jar. 
> Looks like when the driver class path is loaded from spark-default.conf, the 
> REPL's classpath is not correctly appended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2951) SerDeUtils.pythonToPairRDD fails on RDDs of pickled array.arrays in Python 2.6

2014-12-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241654#comment-14241654
 ] 

Apache Spark commented on SPARK-2951:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3668

> SerDeUtils.pythonToPairRDD fails on RDDs of pickled array.arrays in Python 2.6
> --
>
> Key: SPARK-2951
> URL: https://issues.apache.org/jira/browse/SPARK-2951
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Josh Rosen
>Assignee: Davies Liu
> Fix For: 1.2.0
>
>
> With Python 2.6, calling SerDeUtils.pythonToPairRDD() on an RDD of pickled 
> Python array.arrays will fail with this exception:
> {code}
> ava.lang.ClassCastException: java.lang.String cannot be cast to 
> java.util.ArrayList
> 
> net.razorvine.pickle.objects.ArrayConstructor.construct(ArrayConstructor.java:33)
> net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
> net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
> net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
> net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
> 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToPairRDD$1$$anonfun$5.apply(SerDeUtil.scala:106)
> 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToPairRDD$1$$anonfun$5.apply(SerDeUtil.scala:106)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:898)
> 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:880)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> I think this is due to a difference in how array.array is pickled in Python 
> 2.6 vs. Python 2.7.  To see this, run the following script:
> {code}
> from pickletools import dis, optimize
> from pickle import dumps, loads, HIGHEST_PROTOCOL
> from array import array
> arr = array('d', [1, 2, 3])
> #protocol = HIGHEST_PROTOCOL
> protocol = 0
> pickled = dumps(arr, protocol=protocol)
> pickled = optimize(pickled)
> unpickled = loads(pickled)
> print arr
> print unpickled
> print dis(pickled)
> {code}
> In Python 2.7, this outputs
> {code}
> array('d', [1.0, 2.0, 3.0])
> array('d', [1.0, 2.0, 3.0])
> 0: cGLOBAL 'array array'
>13: (MARK
>14: SSTRING 'd'
>19: (MARK
>20: lLIST   (MARK at 19)
>21: FFLOAT  1.0
>26: aAPPEND
>27: FFLOAT  2.0
>32: aAPPEND
>33: FFLOAT  3.0
>38: aAPPEND
>39: tTUPLE  (MARK at 13)
>40: RREDUCE
>41: .STOP
> highest protocol among opcodes = 0
> None
> {code}
> whereas 2.6 outputs
> {code}
> array('d', [1.0, 2.0, 3.0])
> array('d', [1.0, 2.0, 3.0])
> 0: cGLOBAL 'array array'
>13: (MARK
>14: SSTRING 'd'
>19: SSTRING 
> '\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@'
>   110: tTUPLE  (MARK at 13)
>   111: RREDUCE
>   112: .STOP
> highest protocol among opcodes = 0
> None
> {code}
> I think the Java-side depickling library doesn't expect this pickled format, 
> causing this failure.
> I noticed this when running PySpark's unit tests on 2.6 because the 
> TestOuputFormat.test_newhadoop test failed.
> I think that this issue affects all of the methods that might need to 
> depickle arrays in Java, including all of the Hadoop output format methods.
> How should we try to fix this?  Require that users upgrade to 2.7 if they 
> want to use code that requires this?  Open a bug with the depickling library 
> maintainers?  Try to hack in our own pickling routines for arrays if we 
> detect that we're using 2.6?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3918) Forget Unpersist in RandomForest.scala(train Method)

2014-12-10 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241649#comment-14241649
 ] 

Joseph K. Bradley commented on SPARK-3918:
--

Oops!  I forgot to update that PR's name.  It was originally in that PR, but 
[~Junlong Liu] sent a PR with the change first:
[https://github.com/apache/spark/commit/942847fd94c920f7954ddf01f97263926e512b0e]

(The PR linked above was not tagged with this JIRA.)

> Forget Unpersist in RandomForest.scala(train Method)
> 
>
> Key: SPARK-3918
> URL: https://issues.apache.org/jira/browse/SPARK-3918
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: All
>Reporter: junlong
>Assignee: Joseph K. Bradley
>  Labels: decisiontree, train, unpersist
> Fix For: 1.1.0
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
>In version 1.1.0 DecisionTree.scala, train Method, treeInput has been 
> persisted in Memory, but without unpersist. It caused heavy DISK usage.
>In github version(1.2.0 maybe), RandomForest.scala, train Method, 
> baggedInput has been persisted but without unpersisted too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3702) Standardize MLlib classes for learners, models

2014-12-10 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3702:
-
Description: 
Summary: Create a class hierarchy for learning algorithms and the models those 
algorithms produce.

This is a super-task of several sub-tasks (but JIRA does not allow subtasks of 
subtasks).  See the "requires" links below for subtasks.

Goals:
* give intuitive structure to API, both for developers and for generated 
documentation
* support meta-algorithms (e.g., boosting)
* support generic functionality (e.g., evaluation)
* reduce code duplication across classes

[Design doc for class hierarchy | 
https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]

  was:
Summary: Create a class hierarchy for learning algorithms and the models those 
algorithms produce.

This is a super-task of several sub-tasks (but JIRA does not allow subtasks of 
subtasks).  See the "depends on" links below for subtasks.

Goals:
* give intuitive structure to API, both for developers and for generated 
documentation
* support meta-algorithms (e.g., boosting)
* support generic functionality (e.g., evaluation)
* reduce code duplication across classes

[Design doc for class hierarchy | 
https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]


> Standardize MLlib classes for learners, models
> --
>
> Key: SPARK-3702
> URL: https://issues.apache.org/jira/browse/SPARK-3702
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
> of subtasks).  See the "requires" links below for subtasks.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

2014-12-10 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241611#comment-14241611
 ] 

Joseph K. Bradley commented on SPARK-3702:
--

APIs for Classifiers, Regressors

> Standardize MLlib classes for learners, models
> --
>
> Key: SPARK-3702
> URL: https://issues.apache.org/jira/browse/SPARK-3702
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
> of subtasks).  See the "depends on" links below for subtasks.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3702) Standardize MLlib classes for learners, models

2014-12-10 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3702:
-
Description: 
Summary: Create a class hierarchy for learning algorithms and the models those 
algorithms produce.

This is a super-task of several sub-tasks (but JIRA does not allow subtasks of 
subtasks).  See the "depends on" links below for subtasks.

Goals:
* give intuitive structure to API, both for developers and for generated 
documentation
* support meta-algorithms (e.g., boosting)
* support generic functionality (e.g., evaluation)
* reduce code duplication across classes

[Design doc for class hierarchy | 
https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]

  was:
Summary: Create a class hierarchy for learning algorithms and the models those 
algorithms produce.

Goals:
* give intuitive structure to API, both for developers and for generated 
documentation
* support meta-algorithms (e.g., boosting)
* support generic functionality (e.g., evaluation)
* reduce code duplication across classes

[Design doc for class hierarchy | 
https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]


> Standardize MLlib classes for learners, models
> --
>
> Key: SPARK-3702
> URL: https://issues.apache.org/jira/browse/SPARK-3702
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
> of subtasks).  See the "depends on" links below for subtasks.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4789) Standardize ML Prediction APIs

2014-12-10 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-4789:


Assignee: Joseph K. Bradley

> Standardize ML Prediction APIs
> --
>
> Key: SPARK-4789
> URL: https://issues.apache.org/jira/browse/SPARK-4789
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> Create a standard set of abstractions for prediction in spark.ml.  This will 
> follow the design doc specified in [SPARK-3702].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4789) Standardize ML Prediction APIs

2014-12-10 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4789:
-
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-1856

> Standardize ML Prediction APIs
> --
>
> Key: SPARK-4789
> URL: https://issues.apache.org/jira/browse/SPARK-4789
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Create a standard set of abstractions for prediction in spark.ml.  This will 
> follow the design doc specified in [SPARK-3702].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel

2014-12-10 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241535#comment-14241535
 ] 

Debasish Das commented on SPARK-4675:
-

There are few issues:

1. Batch API for topK similar users and topK similar products
2. Comparison of product x product similarities generated with 
columnSimilarities and compared with topK similar products

I added batch APIs for topK product recommendation for each user and topK user 
recommendation for each product in SPARK-4231...similar batch API will be very 
helpful for topK similar users and topK similar products...

I agree with Cosine Similarity...you should be able to re-use column similarity 
calculations...I think a better idea is to add rowMatrix.similarRows and re-use 
that code to generate product similarities and user similarities...

But my question is more on validation. We can compute product similarities on 
raw features and we can compute product similarities on matrix product 
factor...which one is better ?

> Find similar products and similar users in MatrixFactorizationModel
> ---
>
> Key: SPARK-4675
> URL: https://issues.apache.org/jira/browse/SPARK-4675
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Steven Bourke
>Priority: Trivial
>  Labels: mllib, recommender
>
> Using the latent feature space that is learnt in MatrixFactorizationModel, I 
> have added 2 new functions to find similar products and similar users. A user 
> of the API can for example pass a product ID, and get the closest products. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241524#comment-14241524
 ] 

Apache Spark commented on SPARK-4740:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/3667

> Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
> 
>
> Key: SPARK-4740
> URL: https://issues.apache.org/jira/browse/SPARK-4740
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Zhang, Liye
>Assignee: Reynold Xin
> Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
> Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
> sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
> 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
> sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
> TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
> node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z
>
>
> When testing current spark master (1.3.0-snapshot) with spark-perf 
> (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
> takes much longer time than NIO based shuffle transferService. The network 
> throughput of Netty is only about half of that of NIO. 
> We tested with standalone mode, and the data set we used for test is 20 
> billion records, and the total size is about 400GB. Spark-perf test is 
> Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
> executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-10 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241523#comment-14241523
 ] 

Reynold Xin commented on SPARK-4740:


Also [~jerryshao] when I asked you to disable transferTo, the code you pasted 
still return a ByteBuffer, which wouldn't work in Netty. Was the code you 
pasted here different from what was compiled?


Can you try this patch, which disables transferTo? 
https://github.com/apache/spark/pull/3667


> Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
> 
>
> Key: SPARK-4740
> URL: https://issues.apache.org/jira/browse/SPARK-4740
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Zhang, Liye
>Assignee: Reynold Xin
> Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
> Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
> sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
> 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
> sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
> TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
> node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z
>
>
> When testing current spark master (1.3.0-snapshot) with spark-perf 
> (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
> takes much longer time than NIO based shuffle transferService. The network 
> throughput of Netty is only about half of that of NIO. 
> We tested with standalone mode, and the data set we used for test is 20 
> billion records, and the total size is about 400GB. Spark-perf test is 
> Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
> executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-10 Thread Aaron Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241463#comment-14241463
 ] 

Aaron Davidson commented on SPARK-4740:
---

Clarification: The merged version of Reynold's patch has connectionsPerPeer set 
to 1, since we could not demonstrate a significant improvement with other 
values. In your test with HDDs, did you have it set, or was it using the 
default value?

> Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
> 
>
> Key: SPARK-4740
> URL: https://issues.apache.org/jira/browse/SPARK-4740
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Zhang, Liye
>Assignee: Reynold Xin
> Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
> Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
> sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
> 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
> sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
> TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
> node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z
>
>
> When testing current spark master (1.3.0-snapshot) with spark-perf 
> (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
> takes much longer time than NIO based shuffle transferService. The network 
> throughput of Netty is only about half of that of NIO. 
> We tested with standalone mode, and the data set we used for test is 20 
> billion records, and the total size is about 400GB. Spark-perf test is 
> Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
> executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4569) Rename "externalSorting" in Aggregator

2014-12-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241455#comment-14241455
 ] 

Apache Spark commented on SPARK-4569:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/3666

> Rename "externalSorting" in Aggregator
> --
>
> Key: SPARK-4569
> URL: https://issues.apache.org/jira/browse/SPARK-4569
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Priority: Trivial
>
> While technically all spilling in Spark does result in sorting, calling this 
> variable externalSorting makes it seem like ExternalSorter will be used, when 
> in fact it just means whether spilling is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1037) the name of findTaskFromList & findTask in TaskSetManager.scala is confusing

2014-12-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241442#comment-14241442
 ] 

Apache Spark commented on SPARK-1037:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/3665

> the name of findTaskFromList & findTask in TaskSetManager.scala is confusing
> 
>
> Key: SPARK-1037
> URL: https://issues.apache.org/jira/browse/SPARK-1037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.9.1, 1.0.0
>Reporter: Nan Zhu
>Priority: Minor
>  Labels: starter
>
> the name of these two functions is confusing 
> though in the comments the author said that the method does "dequeue" tasks 
> from the list but from the name, it is not explicitly indicating that the 
> method will mutate the parameter
> in 
> private def findTaskFromList(list: ArrayBuffer[Int]): Option[Int] = {
> while (!list.isEmpty) {
>   val index = list.last
>   list.trimEnd(1)
>   if (copiesRunning(index) == 0 && !successful(index)) {
> return Some(index)
>   }
> }
> None
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3607) ConnectionManager threads.max configs on the thread pools don't work

2014-12-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241431#comment-14241431
 ] 

Apache Spark commented on SPARK-3607:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/3664

> ConnectionManager threads.max configs on the thread pools don't work
> 
>
> Key: SPARK-3607
> URL: https://issues.apache.org/jira/browse/SPARK-3607
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Priority: Minor
>
> In the ConnectionManager we have a bunch of thread pools. They have settings 
> for the maximum number of threads for each Threadpool (like 
> spark.core.connection.handler.threads.max). 
> Those configs don't work because its using a unbounded queue. From the 
> threadpoolexecutor javadoc page: no more than corePoolSize threads will ever 
> be created. (And the value of the maximumPoolSize therefore doesn't have any 
> effect.)
> luckily this doesn't matter to much as you can work around it by just 
> increasing the minimum like spark.core.connection.handler.threads.min. 
> These configs aren't documented either so its more of an internal thing when 
> someone is reading the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4746) integration tests should be separated from faster unit tests

2014-12-10 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241383#comment-14241383
 ] 

Imran Rashid commented on SPARK-4746:
-

I think this can be done two ways: (1) by moving all the integration tests into 
a separate folder (eg core/src/integ-test).  I don't actually know how to make 
that work in sbt & maven but I'm hoping it won't be too complicated.

or (2) we use scalatest's test tags for integration tests, so they can be 
skipped.

http://scalatest.org/user_guide/tagging_your_tests

test tags have the advantage that you can have multple tags, and you can group 
them, etc., so you can get more flexibility in what you choose to run.  This 
could be useful later eg. we might want to tag performance tests as separate 
from correctness tests, etc.  We don't need to do all that now but this would 
open the door for it at least.

I have some experience w/ using test tags.  If we want to use that approach, I 
can assign to myself and work on a PR.

> integration tests should be separated from faster unit tests
> 
>
> Key: SPARK-4746
> URL: https://issues.apache.org/jira/browse/SPARK-4746
> Project: Spark
>  Issue Type: Bug
>Reporter: Imran Rashid
>Priority: Trivial
>
> Currently there isn't a good way for a developer to skip the longer 
> integration tests.  This can slow down local development.  See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html
> One option is to use scalatest's notion of test tags to tag all integration 
> tests, so they could easily be skipped



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4818) Join operation should use iterator/lazy evaluation

2014-12-10 Thread Johannes Simon (JIRA)

Johannes Simon created SPARK-4818:
-

 Summary: Join operation should use iterator/lazy evaluation
 Key: SPARK-4818
 URL: https://issues.apache.org/jira/browse/SPARK-4818
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.1
Reporter: Johannes Simon
Priority: Minor


The current implementation of the join operation does not use an iterator (i.e. 
lazy evaluation), causing it to explicitly evaluate the co-grouped values. In 
big data applications, these value collections can be very large. This causes 
the *cartesian product of all co-grouped values* for a specific key of both 
RDDs to be kept in memory during the flatMapValues operation, resulting in an 
*O(size(pair._1)*size(pair._2))* memory consumption instead of *O(1)*. Very 
large value collections will therefore cause "GC overhead limit exceeded" 
exceptions and fail the task, or at least slow down execution dramatically.

{code:title=PairRDDFunctions.scala|borderStyle=solid}
//...
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = {
  this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1; w <- pair._2) yield (v, w)
  )
}
//...
{code}

Since cogroup returns an Iterable instance of an Array, the join implementation 
could be changed to the following, which uses lazy evaluation instead, and has 
almost no memory overhead:
{code:title=PairRDDFunctions.scala|borderStyle=solid}
//...
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = {
  this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
  )
}
//...
{code}

Alternatively, if the current implementation is intentionally not using lazy 
evaluation for some reason, there could be a *lazyJoin()* method next to the 
original join implementation that utilizes lazy evaluation. This of course 
applies to other join operations as well.

Thanks! :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped

2014-12-10 Thread Mark Fisher (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241362#comment-14241362
 ] 

Mark Fisher commented on SPARK-2892:


[~srowen] SPARK-4802 is only related to the receiverInfo not being removed. 
This issue is actually much more critical, given that Receivers do not seem to 
stop other than in local mode. Please reopen.


> Socket Receiver does not stop when streaming context is stopped
> ---
>
> Key: SPARK-2892
> URL: https://issues.apache.org/jira/browse/SPARK-2892
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Running NetworkWordCount with
> {quote}  
> ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); 
> Thread.sleep(6)
> {quote}
> gives the following error
> {quote}
> 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
> in 10047 ms on localhost (1/1)
> 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at 
> ReceiverTracker.scala:275) finished in 10.056 s
> 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
> have all completed, from pool
> 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at 
> ReceiverTracker.scala:275, took 10.179263 s
> 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been 
> terminated
> 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not 
> deregistered, Map(0 -> 
> ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,))
> 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped
> 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately
> 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after 
> time 1407375433000
> 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator
> 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler
> 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully
> 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving
> 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost:
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4746) integration tests should be separated from faster unit tests

2014-12-10 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-4746:

Summary: integration tests should be separated from faster unit tests  
(was: integration tests should be seseparated from faster unit tests)

> integration tests should be separated from faster unit tests
> 
>
> Key: SPARK-4746
> URL: https://issues.apache.org/jira/browse/SPARK-4746
> Project: Spark
>  Issue Type: Bug
>Reporter: Imran Rashid
>Priority: Trivial
>
> Currently there isn't a good way for a developer to skip the longer 
> integration tests.  This can slow down local development.  See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html
> One option is to use scalatest's notion of test tags to tag all integration 
> tests, so they could easily be skipped



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1338) Create Additional Style Rules

2014-12-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241297#comment-14241297
 ] 

Sean Owen commented on SPARK-1338:
--

The PR for this was abandoned. What's the thinking on these style-rule JIRAs? 
there are a number still open.

> Create Additional Style Rules
> -
>
> Key: SPARK-1338
> URL: https://issues.apache.org/jira/browse/SPARK-1338
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Prashant Sharma
> Fix For: 1.2.0
>
>
> There are a few other rules that would be helpful to have. Also we should add 
> tests for these rules because it's easy to get them wrong. I gave some 
> example comparisons from a javascript style checker.
> Require spaces in type declarations:
> def foo:String = X // no
> def foo: String = XXX
> def x:Int = 100 // no
> val x: Int = 100
> Require spaces after keywords:
> if(x - 3) // no
> if (x + 10)
> See: requireSpaceAfterKeywords from
> https://github.com/mdevils/node-jscs
> Disallow spaces inside of parentheses:
> val x = ( 3 + 5 ) // no
> val x = (3 + 5)
> See: disallowSpacesInsideParentheses from
> https://github.com/mdevils/node-jscs
> Require spaces before and after binary operators:
> See: requireSpaceBeforeBinaryOperators
> See: disallowSpaceAfterBinaryOperators
> from https://github.com/mdevils/node-jscs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1380) Add sort-merge based cogroup/joins.

2014-12-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1380.
--
Resolution: Won't Fix

The PR discussion suggests this is WontFix.

> Add sort-merge based cogroup/joins.
> ---
>
> Key: SPARK-1380
> URL: https://issues.apache.org/jira/browse/SPARK-1380
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Takuya Ueshin
>
> I've written cogroup/joins based on 'Sort-Merge' algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1385) Use existing code-path for JSON de/serialization of BlockId

2014-12-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1385.
--
Resolution: Fixed

PR is https://github.com/apache/spark/pull/289. This was merged in 
https://github.com/apache/spark/commit/de8eefa804e229635eaa29a78b9e9ce161ac58e1

> Use existing code-path for JSON de/serialization of BlockId
> ---
>
> Key: SPARK-1385
> URL: https://issues.apache.org/jira/browse/SPARK-1385
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.9.0, 0.9.1
>Reporter: Andrew Or
>Priority: Minor
> Fix For: 1.0.0
>
>
> BlockId.scala already takes care of JSON de/serialization by parsing the 
> string to and from regex. This functionality is currently duplicated in 
> util/JsonProtocol.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1127) Add saveAsHBase to PairRDDFunctions

2014-12-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1127.
--
   Resolution: Won't Fix
Fix Version/s: (was: 1.2.0)

Given the discussion in both PRs, this looks like a WontFix, and consensus was 
it should proceed in a separate project. Questions about SchemaRDD and 1.2 
sound like a new topic.

> Add saveAsHBase to PairRDDFunctions
> ---
>
> Key: SPARK-1127
> URL: https://issues.apache.org/jira/browse/SPARK-1127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: haosdent huang
>Assignee: haosdent huang
>
> Support to save data in HBase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 155 matches

Mail list logo