[jira] [Commented] (SPARK-5685) Show warning when users open text files compressed with non-splittable algorithms like gzip

2015-02-09 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311931#comment-14311931
 ] 

Nicholas Chammas commented on SPARK-5685:
-

[~joshrosen] - What do you think of adding a warning like this?

 Show warning when users open text files compressed with non-splittable 
 algorithms like gzip
 ---

 Key: SPARK-5685
 URL: https://issues.apache.org/jira/browse/SPARK-5685
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Nicholas Chammas
Priority: Minor

 This is a usability or user-friendliness issue.
 It's extremely common for people to load a text file compressed with gzip, 
 process it, and then wonder why only 1 core in their cluster is doing any 
 work.
 Some examples:
 * http://stackoverflow.com/q/28127119/877069
 * http://stackoverflow.com/q/27531816/877069
 I'm not sure how this problem can be generalized, but at the very least it 
 would be helpful if Spark displayed some kind of warning in the common case 
 when someone opens a gzipped file with {{sc.textFile}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5682) Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle

2015-02-09 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated SPARK-5682:

Attachment: encrypted_shuffle.patch.4

encrypted_shuffle.patch.4 is how to reuse hadoop encrypted class to enable 
spark encrypted shuffle.
How to use
patch -p1encrypted_shuffle.patch.4

 Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark 
 Shuffle_20150209.docx, encrypted_shuffle.patch.4


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. We reuse hadoop encrypted 
 shuffle feature to spark and because ugi credential info is necessary in 
 encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn 
 framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5681) Calling graceful stop() immediately after start() on StreamingContext should not get stuck indefinitely

2015-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5681:
-
Component/s: Streaming

 Calling graceful stop() immediately after start() on StreamingContext should 
 not get stuck indefinitely
 ---

 Key: SPARK-5681
 URL: https://issues.apache.org/jira/browse/SPARK-5681
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Liang-Chi Hsieh

 Sometimes the receiver will be registered into tracker after ssc.stop is 
 called. Especially when stop() is called immediately after start(). So the 
 receiver doesn't get the StopReceiver message from the tracker. In this case, 
 when you call stop() in graceful mode, stop() would get stuck indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-09 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311956#comment-14311956
 ] 

Manoj Kumar commented on SPARK-5016:


[~tgaloppo] I would like your inputs on this as well.

 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5684) Key not found exception is thrown in case location of added partition to a parquet table is different than a path containing the partition values

2015-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311943#comment-14311943
 ] 

Apache Spark commented on SPARK-5684:
-

User 'saucam' has created a pull request for this issue:
https://github.com/apache/spark/pull/4469

 Key not found exception is thrown in case location of added partition to a 
 parquet table is different than a path containing the partition values
 -

 Key: SPARK-5684
 URL: https://issues.apache.org/jira/browse/SPARK-5684
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0
Reporter: Yash Datta
 Fix For: 1.3.0


 Create a partitioned parquet table : 
 create table test_table (dummy string) partitioned by (timestamp bigint) 
 stored as parquet;
 Add a partition to the table and specify a different location:
 alter table test_table add partition (timestamp=9) location 
 '/data/pth/different'
 Run a simple select  * query 
 we get an exception :
 15/02/09 08:27:25 ERROR thriftserver.SparkSQLDriver: Failed in [select * from 
 db4_mi2mi_binsrc1_default limit 5]
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 
 (TID 21, localhost): java
 .util.NoSuchElementException: key not found: timestamp
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.MapLike$class.apply(MapLike.scala:141)
 at scala.collection.AbstractMap.apply(Map.scala:58)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:141)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:128)
 at 
 org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:247)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 This happens because in parquet path it is assumed that (key=value) patterns 
 are present in the partition location, which is not always the case!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-02-09 Thread irene rognoni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311948#comment-14311948
 ] 

irene rognoni commented on SPARK-5281:
--

same issue here, since last week, no news on this?

 Registering table on RDD is giving MissingRequirementError
 --

 Key: SPARK-5281
 URL: https://issues.apache.org/jira/browse/SPARK-5281
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: sarsol
Priority: Critical

 Application crashes on this line  rdd.registerTempTable(temp)  in 1.2 
 version when using sbt or Eclipse SCALA IDE
 Stacktrace 
 Exception in thread main scala.reflect.internal.MissingRequirementError: 
 class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
 primordial classloader with boot classpath 
 [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
  Files\Java\jre7\lib\resources.jar;C:\Program 
 Files\Java\jre7\lib\rt.jar;C:\Program 
 Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
 Files\Java\jre7\lib\jsse.jar;C:\Program 
 Files\Java\jre7\lib\jce.jar;C:\Program 
 Files\Java\jre7\lib\charsets.jar;C:\Program 
 Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
   at 
 scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
   at 
 scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
   at 
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
   at 
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
   at 
 scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
   at 
 com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
   at 
 scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
   at scala.App$$anonfun$main$1.apply(App.scala:71)
   at scala.App$$anonfun$main$1.apply(App.scala:71)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
   at scala.App$class.main(App.scala:71)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo

2015-02-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311990#comment-14311990
 ] 

Sean Owen commented on SPARK-5676:
--

Dumb question, I know, but what is the spark-ec2 repo you guys refer to? This 
one? https://github.com/mesos/spark-ec2

The only repo for Spark is the main spark repo and it has all this license info 
buttoned up. Other repos are not part of Spark.

 License missing from spark-ec2 repo
 ---

 Key: SPARK-5676
 URL: https://issues.apache.org/jira/browse/SPARK-5676
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein

 There is no LICENSE file or licence headers in the code in the spark-ec2 
 repo. Also, I believe there is no contributor license agreement notification 
 in place (like there is in the main spark repo).
 It would be great to fix this (sooner better than later while contributors 
 list is small), so that users wishing to use this part of Spark are not in 
 doubt over licensing issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5679) Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method

2015-02-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311994#comment-14311994
 ] 

Sean Owen commented on SPARK-5679:
--

Same as SPARK-5227?

 Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads 
 and input metrics with mixed read method 
 --

 Key: SPARK-5679
 URL: https://issues.apache.org/jira/browse/SPARK-5679
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Kostas Sakellis
Priority: Blocker

 Please audit these and see if there are any assumptions with respect to File 
 IO that might not hold in all cases. I'm happy to help if you can't find 
 anything.
 These both failed in the same run:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-SBT/38/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink
 {code}
 org.apache.spark.metrics.InputOutputMetricsSuite.input metrics with mixed 
 read method
 Failing for the past 13 builds (Since Failed#26 )
 Took 48 sec.
 Error Message
 2030 did not equal 6496
 Stacktrace
 sbt.ForkMain$ForkError: 2030 did not equal 6496
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at 
 org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply$mcV$sp(InputOutputMetricsSuite.scala:135)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfter$$super$runTest(InputOutputMetricsSuite.scala:46)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.runTest(InputOutputMetricsSuite.scala:46)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfterAll$$super$run(InputOutputMetricsSuite.scala:46)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
   at 
 

[jira] [Resolved] (SPARK-5473) Expose SSH failures after status checks pass

2015-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5473.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4262
[https://github.com/apache/spark/pull/4262]

 Expose SSH failures after status checks pass
 

 Key: SPARK-5473
 URL: https://issues.apache.org/jira/browse/SPARK-5473
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.2.0
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5473) Expose SSH failures after status checks pass

2015-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5473:
-
Assignee: Nicholas Chammas

 Expose SSH failures after status checks pass
 

 Key: SPARK-5473
 URL: https://issues.apache.org/jira/browse/SPARK-5473
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.2.0
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5239) JdbcRDD throws java.lang.AbstractMethodError: oracle.jdbc.driver.xxxxxx.isClosed()Z

2015-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312040#comment-14312040
 ] 

Apache Spark commented on SPARK-5239:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4470

 JdbcRDD throws java.lang.AbstractMethodError: 
 oracle.jdbc.driver.xx.isClosed()Z
 -

 Key: SPARK-5239
 URL: https://issues.apache.org/jira/browse/SPARK-5239
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment: centos6.4 + ojdbc14
Reporter: Gankun Luo
Priority: Minor

 I try use JdbcRDD to operate the table of Oracle database, but failed. My 
 test code as follows:
 {code}
 import java.sql.DriverManager
 import org.apache.spark.SparkContext
 import org.apache.spark.rdd.JdbcRDD
 import org.apache.spark.SparkConf
 object JdbcRDD4Oracle {
   def main(args: Array[String]) {
 val sc = new SparkContext(new 
 SparkConf().setAppName(JdbcRDD4Oracle).setMaster(local[2]))
 val rdd = new JdbcRDD(sc,
   () = getConnection, getSQL, 12987, 13055, 3,
   r = {
 (r.getObject(HISTORY_ID), r.getObject(APPROVE_OPINION))
   })
 println(rdd.collect.toList)
 
 sc.stop()
   }
   def getConnection() = {
 Class.forName(oracle.jdbc.driver.OracleDriver).newInstance()
 DriverManager.getConnection(jdbc:oracle:thin:@hadoop000:1521/ORCL, 
 scott, tiger)
   }
   
   def getSQL() = {
   select HISTORY_ID,APPROVE_OPINION from CI_APPROVE_HISTORY WHERE 
 HISTORY_ID =? AND HISTORY_ID =? 
   }
 }
 {code}
 Run the example, I get the following exception:
 {code}
 09:56:48,302 [Executor task launch worker-0] ERROR Logging$class : Error in 
 TaskCompletionListener
 java.lang.AbstractMethodError: 
 oracle.jdbc.driver.OracleResultSetImpl.isClosed()Z
   at org.apache.spark.rdd.JdbcRDD$$anon$1.close(JdbcRDD.scala:99)
   at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
   at 
 org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:71)
   at 
 org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:71)
   at 
 org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:85)
   at 
 org.apache.spark.TaskContext$$anonfun$markTaskCompleted$1.apply(TaskContext.scala:110)
   at 
 org.apache.spark.TaskContext$$anonfun$markTaskCompleted$1.apply(TaskContext.scala:108)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at org.apache.spark.TaskContext.markTaskCompleted(TaskContext.scala:108)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:64)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 09:56:48,302 [Executor task launch worker-1] ERROR Logging$class : Error in 
 TaskCompletionListener
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5688) In Decision Trees, choosing a random subset of categories for each split

2015-02-09 Thread Eric Denovitzer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Denovitzer updated SPARK-5688:
---
Labels: categorical decisiontree  (was: categorical)

 In Decision Trees, choosing a random subset of categories for each split
 

 Key: SPARK-5688
 URL: https://issues.apache.org/jira/browse/SPARK-5688
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
 Environment: Any
Reporter: Eric Denovitzer
  Labels: categorical, decisiontree
 Fix For: 1.2.0


 The categories on each subset chosen to build a split on a categorical 
 variable  was not random. The categories for the subset are chosen based on 
 the binary representation of a number from 1 to (2^(number of categories)) - 
 2 (excludes empty and full subset). On the current implementation, the 
 integers used for the subsets are 1..numSplits. This should be random instead 
 of biasing towards the categories with the lower indexes. 
 Another problem is that if numBins/2 is bigger than the possible subsets 
 given a category set, it still considered the numSplits to be numBins/2. This 
 should be the min of numBins/2 and  (2^(number of categories)) - 2 (otherwise 
 the same subsets might be considered more than once when choosing the splits).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4423) Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior

2015-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4423:
-
Assignee: Ilya Ganelin

 Improve foreach() documentation to avoid confusion between local- and 
 cluster-mode behavior
 ---

 Key: SPARK-4423
 URL: https://issues.apache.org/jira/browse/SPARK-4423
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Josh Rosen
Assignee: Ilya Ganelin

 {{foreach}} seems to be a common source of confusion for new users: in 
 {{local}} mode, {{foreach}} can be used to update local variables on the 
 driver, but programs that do this will not work properly when executed on 
 clusters, since the {{foreach}} will update per-executor variables (note that 
 this _will_ work correctly for accumulators, but not for other types of 
 mutable objects).
 Similarly, I've seen users become confused when {{.foreach(println)}} doesn't 
 print to the driver's standard output.
 At a minimum, we should improve the documentation to warn users against 
 unsafe uses of {{foreach}} that won't work properly when transitioning from 
 local mode to a real cluster.
 We might also consider changes to local mode so that its behavior more 
 closely matches the cluster modes; this will require some discussion, though, 
 since any change of behavior here would technically be a user-visible 
 backwards-incompatible change (I don't think that we made any explicit 
 guarantees about the current local-mode behavior, but someone might be 
 relying on the current implicit behavior).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4655) Split Stage into ShuffleMapStage and ResultStage subclasses

2015-02-09 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312458#comment-14312458
 ] 

Ilya Ganelin commented on SPARK-4655:
-

Hi [~joshrosen], I'd be happy to work on this. Thanks!

 Split Stage into ShuffleMapStage and ResultStage subclasses
 ---

 Key: SPARK-4655
 URL: https://issues.apache.org/jira/browse/SPARK-4655
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 The scheduler's {{Stage}} class has many fields which are only applicable to 
 result stages or shuffle map stages.  As a result, I think that it makes 
 sense to make {{Stage}} into an abstract base class with two subclasses, 
 {{ResultStage}} and {{ShuffleMapStage}}.  This would improve the 
 understandability of the DAGScheduler code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4423) Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior

2015-02-09 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312424#comment-14312424
 ] 

Ilya Ganelin commented on SPARK-4423:
-

I'll be happy to update this. Thank you.

 Improve foreach() documentation to avoid confusion between local- and 
 cluster-mode behavior
 ---

 Key: SPARK-4423
 URL: https://issues.apache.org/jira/browse/SPARK-4423
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Josh Rosen

 {{foreach}} seems to be a common source of confusion for new users: in 
 {{local}} mode, {{foreach}} can be used to update local variables on the 
 driver, but programs that do this will not work properly when executed on 
 clusters, since the {{foreach}} will update per-executor variables (note that 
 this _will_ work correctly for accumulators, but not for other types of 
 mutable objects).
 Similarly, I've seen users become confused when {{.foreach(println)}} doesn't 
 print to the driver's standard output.
 At a minimum, we should improve the documentation to warn users against 
 unsafe uses of {{foreach}} that won't work properly when transitioning from 
 local mode to a real cluster.
 We might also consider changes to local mode so that its behavior more 
 closely matches the cluster modes; this will require some discussion, though, 
 since any change of behavior here would technically be a user-visible 
 backwards-incompatible change (I don't think that we made any explicit 
 guarantees about the current local-mode behavior, but someone might be 
 relying on the current implicit behavior).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5651) Support 'create db.table' in HiveContext

2015-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312344#comment-14312344
 ] 

Apache Spark commented on SPARK-5651:
-

User 'OopsOutOfMemory' has created a pull request for this issue:
https://github.com/apache/spark/pull/4473

 Support 'create db.table' in HiveContext
 

 Key: SPARK-5651
 URL: https://issues.apache.org/jira/browse/SPARK-5651
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yadong Qi

 Now spark version is only support ```create table 
 table_in_database_creation.test1 as select * from src limit 1;``` in 
 HiveContext.
 This patch is used to support ```create table 
 `table_in_database_creation.test2` as select * from src limit 1;``` in 
 HiveContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5570) No docs stating that `new SparkConf().set(spark.driver.memory, ...) will not work

2015-02-09 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312416#comment-14312416
 ] 

Ilya Ganelin commented on SPARK-5570:
-

I'll fix this, can you please assign it to me? Thanks.

 No docs stating that `new SparkConf().set(spark.driver.memory, ...) will 
 not work
 ---

 Key: SPARK-5570
 URL: https://issues.apache.org/jira/browse/SPARK-5570
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Affects Versions: 1.2.0
Reporter: Tathagata Das
Assignee: Andrew Or





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends

2015-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-823:

Assignee: Ilya Ganelin

 spark.default.parallelism's default is inconsistent across scheduler backends
 -

 Key: SPARK-823
 URL: https://issues.apache.org/jira/browse/SPARK-823
 Project: Spark
  Issue Type: Bug
  Components: Documentation, PySpark, Scheduler
Affects Versions: 0.8.0, 0.7.3, 0.9.1
Reporter: Josh Rosen
Assignee: Ilya Ganelin
Priority: Minor

 The [0.7.3 configuration 
 guide|http://spark-project.org/docs/latest/configuration.html] says that 
 {{spark.default.parallelism}}'s default is 8, but the default is actually 
 max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos 
 scheduler, and {{threads}} for the local scheduler:
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150
 Should this be clarified in the documentation?  Should the Mesos scheduler 
 backend's default be revised?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled

2015-02-09 Thread Twinkle Sachdeva (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312328#comment-14312328
 ] 

Twinkle Sachdeva commented on SPARK-4705:
-

Hi [~vanzin]

Please take a look at the screenshot. I will make NA to be non-anchored element.

It shows the UI for a history server, where some of the applications has been 
run on a scheduler where multiple attempts are not supported, whereas some of 
the applications has multiple attempts.

Should we introduce a property, which will show multi-attempt UI by default?

 Driver retries in yarn-cluster mode always fail if event logging is enabled
 ---

 Key: SPARK-4705
 URL: https://issues.apache.org/jira/browse/SPARK-4705
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin
 Attachments: multi-attempts with no attempt based UI.png


 yarn-cluster mode will retry to run the driver in certain failure modes. If 
 even logging is enabled, this will most probably fail, because:
 {noformat}
 Exception in thread Driver java.io.IOException: Log directory 
 hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
  already exists!
 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at 
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:353)
 {noformat}
 The even log path should be more unique. Or perhaps retries of the same app 
 should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5687) in TaskResultGetter need to catch OutOfMemoryError.

2015-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312365#comment-14312365
 ] 

Apache Spark commented on SPARK-5687:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/4474

 in TaskResultGetter need to catch OutOfMemoryError.
 ---

 Key: SPARK-5687
 URL: https://issues.apache.org/jira/browse/SPARK-5687
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Lianhui Wang

 because in enqueueSuccessfulTask there is another thread to fetch result, if 
 result is very large,it maybe throw a OutOfMemoryError. so if we donot catch 
 OutOfMemoryError, DAGDAGScheduler donot know the status of this task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5688) Splits for Categorical Variables in DecisionTrees

2015-02-09 Thread Eric Denovitzer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Denovitzer updated SPARK-5688:
---
Summary: Splits for Categorical Variables in DecisionTrees  (was: In 
Decision Trees, choosing a random subset of categories for each split)

 Splits for Categorical Variables in DecisionTrees
 -

 Key: SPARK-5688
 URL: https://issues.apache.org/jira/browse/SPARK-5688
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
 Environment: Any
Reporter: Eric Denovitzer
  Labels: categorical, decisiontree
 Fix For: 1.2.0


 The categories on each subset chosen to build a split on a categorical 
 variable  was not random. The categories for the subset are chosen based on 
 the binary representation of a number from 1 to (2^(number of categories)) - 
 2 (excludes empty and full subset). On the current implementation, the 
 integers used for the subsets are 1..numSplits. This should be random instead 
 of biasing towards the categories with the lower indexes. 
 Another problem is that if numBins/2 is bigger than the possible subsets 
 given a category set, it still considered the numSplits to be numBins/2. This 
 should be the min of numBins/2 and  (2^(number of categories)) - 2 (otherwise 
 the same subsets might be considered more than once when choosing the splits).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5688) In Decision Trees, choosing a random subset of categories for each split

2015-02-09 Thread Eric Denovitzer (JIRA)
Eric Denovitzer created SPARK-5688:
--

 Summary: In Decision Trees, choosing a random subset of categories 
for each split
 Key: SPARK-5688
 URL: https://issues.apache.org/jira/browse/SPARK-5688
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
 Environment: Any
Reporter: Eric Denovitzer
 Fix For: 1.2.0


The categories on each subset chosen to build a split on a categorical variable 
 was not random. The categories for the subset are chosen based on the binary 
representation of a number from 1 to (2^(number of categories)) - 2 (excludes 
empty and full subset). On the current implementation, the integers used for 
the subsets are 1..numSplits. This should be random instead of biasing towards 
the categories with the lower indexes. 
Another problem is that if numBins/2 is bigger than the possible subsets given 
a category set, it still considered the numSplits to be numBins/2. This should 
be the min of numBins/2 and  (2^(number of categories)) - 2 (otherwise the same 
subsets might be considered more than once when choosing the splits).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5688) Splits for Categorical Variables in DecisionTrees

2015-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312403#comment-14312403
 ] 

Apache Spark commented on SPARK-5688:
-

User 'edenovit' has created a pull request for this issue:
https://github.com/apache/spark/pull/4475

 Splits for Categorical Variables in DecisionTrees
 -

 Key: SPARK-5688
 URL: https://issues.apache.org/jira/browse/SPARK-5688
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
 Environment: Any
Reporter: Eric Denovitzer
  Labels: categorical, decisiontree
 Fix For: 1.2.0


 The categories on each subset chosen to build a split on a categorical 
 variable  was not random. The categories for the subset are chosen based on 
 the binary representation of a number from 1 to (2^(number of categories)) - 
 2 (excludes empty and full subset). On the current implementation, the 
 integers used for the subsets are 1..numSplits. This should be random instead 
 of biasing towards the categories with the lower indexes. 
 Another problem is that if numBins/2 is bigger than the possible subsets 
 given a category set, it still considered the numSplits to be numBins/2. This 
 should be the min of numBins/2 and  (2^(number of categories)) - 2 (otherwise 
 the same subsets might be considered more than once when choosing the splits).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5079) Detect failed jobs / batches in Spark Streaming unit tests

2015-02-09 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312415#comment-14312415
 ] 

Ilya Ganelin commented on SPARK-5079:
-

I can work on this - can you please assign it to me? Thank you. 

 Detect failed jobs / batches in Spark Streaming unit tests
 --

 Key: SPARK-5079
 URL: https://issues.apache.org/jira/browse/SPARK-5079
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Josh Rosen

 Currently, it is possible to write Spark Streaming unit tests where Spark 
 jobs fail but the streaming tests succeed because we rely on wall-clock time 
 plus output comparision in order to check whether a test has passed, and 
 hence may miss cases where errors occurred if they didn't affect these 
 results.  We should strengthen the tests to check that no job failures 
 occurred while processing batches.
 See https://github.com/apache/spark/pull/3832#issuecomment-68580794 for 
 additional context.
 The StreamingTestWaiter in https://github.com/apache/spark/pull/3801 might 
 also fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5689) Document what can be run in different YARN modes

2015-02-09 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-5689:


 Summary: Document what can be run in different YARN modes
 Key: SPARK-5689
 URL: https://issues.apache.org/jira/browse/SPARK-5689
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves


We should document what can be run in the different yarn modes. For instances, 
the interactive shell only work in yarn client mode, recently with 
https://github.com/apache/spark/pull/3976 users can run python scripts in 
cluster mode, etc..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends

2015-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-823.
-
Resolution: Fixed

 spark.default.parallelism's default is inconsistent across scheduler backends
 -

 Key: SPARK-823
 URL: https://issues.apache.org/jira/browse/SPARK-823
 Project: Spark
  Issue Type: Bug
  Components: Documentation, PySpark, Scheduler
Affects Versions: 0.8.0, 0.7.3, 0.9.1
Reporter: Josh Rosen
Assignee: Ilya Ganelin
Priority: Minor

 The [0.7.3 configuration 
 guide|http://spark-project.org/docs/latest/configuration.html] says that 
 {{spark.default.parallelism}}'s default is 8, but the default is actually 
 max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos 
 scheduler, and {{threads}} for the local scheduler:
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150
 Should this be clarified in the documentation?  Should the Mesos scheduler 
 backend's default be revised?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled

2015-02-09 Thread Twinkle Sachdeva (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Twinkle Sachdeva updated SPARK-4705:

Attachment: multi-attempts with no attempt based UI.png

 Driver retries in yarn-cluster mode always fail if event logging is enabled
 ---

 Key: SPARK-4705
 URL: https://issues.apache.org/jira/browse/SPARK-4705
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin
 Attachments: multi-attempts with no attempt based UI.png


 yarn-cluster mode will retry to run the driver in certain failure modes. If 
 even logging is enabled, this will most probably fail, because:
 {noformat}
 Exception in thread Driver java.io.IOException: Log directory 
 hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
  already exists!
 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at 
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:353)
 {noformat}
 The even log path should be more unique. Or perhaps retries of the same app 
 should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5703) JobProgressListener throws empty.max error

2015-02-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5703:
-
Description: 
In JobProgressListener, if you have a JobEnd that does not have a corresponding 
JobStart, AND you render the AllJobsPage, then you'll run into the empty.max 
exception.

I ran into this when trying to replay parts of an event log after trimming a 
few events. Not a common use case I agree, but I'd argue that it should never 
fail on empty.max.

  was:
In JobProgressListener, if you have a JobEnd that does not have a corresponding 
JobStart, AND you render the AllJobsPage, then you'll run into the empty.max 
exception.

I ran into this when trying to replay parts of an event log after trimming a 
few events. Not a common use case I agree, but it should not fail with 
empty.max.


 JobProgressListener throws empty.max error
 --

 Key: SPARK-5703
 URL: https://issues.apache.org/jira/browse/SPARK-5703
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 In JobProgressListener, if you have a JobEnd that does not have a 
 corresponding JobStart, AND you render the AllJobsPage, then you'll run into 
 the empty.max exception.
 I ran into this when trying to replay parts of an event log after trimming a 
 few events. Not a common use case I agree, but I'd argue that it should never 
 fail on empty.max.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5703) JobProgressListener throws empty.max error

2015-02-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5703:
-
Summary: JobProgressListener throws empty.max error  (was: 
JobProgressListener throws empty.max error in HS)

 JobProgressListener throws empty.max error
 --

 Key: SPARK-5703
 URL: https://issues.apache.org/jira/browse/SPARK-5703
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 In JobProgressListener, if you have a JobEnd that does not have a 
 corresponding JobStart, AND you render the AllJobsPage, then you'll run into 
 the empty.max exception.
 I ran into this when trying to replay parts of an event log after trimming a 
 few events. Not a common use case I agree, but it should not fail with 
 empty.max.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5703) JobProgressListener throws empty.max error in HS

2015-02-09 Thread Andrew Or (JIRA)
Andrew Or created SPARK-5703:


 Summary: JobProgressListener throws empty.max error in HS
 Key: SPARK-5703
 URL: https://issues.apache.org/jira/browse/SPARK-5703
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical


In JobProgressListener, if you have a JobEnd that does not have a corresponding 
JobStart, AND you render the AllJobsPage, then you'll run into the empty.max 
exception.

I ran into this when trying to replay parts of an event log after trimming a 
few events. Not a common use case I agree, but it should not fail with 
empty.max.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-02-09 Thread Mark Khaitman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313652#comment-14313652
 ] 

Mark Khaitman commented on SPARK-4105:
--

We're running 1.2.1-rc2 on our cluster and running into the exact same problem. 
Several different jobs by different users will typically run perfectly fine, 
and then another identical run randomly throws the FAILED_TO_UNCOMPRESS(5) 
error, which causes the job to fail altogether actually. I'll try to re-produce 
this somehow, though it is a tricky one!

 FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
 shuffle
 -

 Key: SPARK-4105
 URL: https://issues.apache.org/jira/browse/SPARK-4105
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
 shuffle read.  Here's a sample stacktrace from an executor:
 {code}
 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
 33053)
 java.io.IOException: FAILED_TO_UNCOMPRESS(5)
   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
   at 
 org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
   at 
 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
   at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)
   at 
 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
   at 
 org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 

[jira] [Created] (SPARK-5707) Enabling spark.sql.codegen throws ClassNotFound exception

2015-02-09 Thread Yi Yao (JIRA)
Yi Yao created SPARK-5707:
-

 Summary: Enabling spark.sql.codegen throws ClassNotFound exception
 Key: SPARK-5707
 URL: https://issues.apache.org/jira/browse/SPARK-5707
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: yarn-client mode
Reporter: Yi Yao


org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in 
stage 133.0 failed 4 times, most recent failure: Lost task 13.3 in stage 133.0 
(TID 3066, cdh52-node2): java.io.IOException: 
com.esotericsoftware.kryo.KryoException: Unable to find class: 
__wrapper$1$81257352e1c844aebf09cb84fe9e7459.__wrapper$1$81257352e1c844aebf09cb84fe9e7459$SpecificRow$1
Serialization trace:
hashTable (org.apache.spark.sql.execution.joins.UniqueKeyHashedRelation)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:62)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:61)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



SQL:
INSERT INTO TABLE ${hiveconf:TEMP_TABLE}
SELECT
  s_store_name,
  pr_review_date,
  pr_review_content
FROM (
  --select store_name for stores with flat or declining sales in 3 consecutive 
months.
  SELECT s_store_name
  FROM store s
  JOIN (
-- linear regression part
SELECT
  temp.cat AS cat,
  --SUM(temp.x)as sumX,
  --SUM(temp.y)as sumY,
  --SUM(temp.xy)as 

[jira] [Created] (SPARK-5708) Add Slf4jSink to Spark Metrics Sink

2015-02-09 Thread Judy Nash (JIRA)
Judy Nash created SPARK-5708:


 Summary: Add Slf4jSink to Spark Metrics Sink
 Key: SPARK-5708
 URL: https://issues.apache.org/jira/browse/SPARK-5708
 Project: Spark
  Issue Type: Bug
Reporter: Judy Nash


Add Slf4jSink to the currently supported metric sinks.

This is convenient for those who want metrics data for telemetry purposes, but 
want to reuse the pre-setup log4j pipeline. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2653) Heap size should be the sum of driver.memory and executor.memory in local mode

2015-02-09 Thread liu chang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313664#comment-14313664
 ] 

liu chang commented on SPARK-2653:
--

Hi, Davies, what's wrong with this?

 Heap size should be the sum of driver.memory and executor.memory in local mode
 --

 Key: SPARK-2653
 URL: https://issues.apache.org/jira/browse/SPARK-2653
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Davies Liu
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 In local mode, the driver and executor run in the same JVM, so the heap size 
 of JVM should be the sum of spark.driver.memory and spark.executor.memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5703) AllJobsPage throws empty.max error

2015-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313332#comment-14313332
 ] 

Apache Spark commented on SPARK-5703:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/4490

 AllJobsPage throws empty.max error
 --

 Key: SPARK-5703
 URL: https://issues.apache.org/jira/browse/SPARK-5703
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 In JobProgressListener, if you have a JobEnd that does not have a 
 corresponding JobStart, AND you render the AllJobsPage, then you'll run into 
 the empty.max exception.
 I ran into this when trying to replay parts of an event log after trimming a 
 few events. Not a common use case I agree, but I'd argue that it should never 
 fail on empty.max.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5597) Model import/export for DecisionTree and ensembles

2015-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313388#comment-14313388
 ] 

Apache Spark commented on SPARK-5597:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4493

 Model import/export for DecisionTree and ensembles
 --

 Key: SPARK-5597
 URL: https://issues.apache.org/jira/browse/SPARK-5597
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 See parent JIRA for more info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5558) pySpark zip function unexpected errors

2015-02-09 Thread Charles Hayden (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313434#comment-14313434
 ] 

Charles Hayden edited comment on SPARK-5558 at 2/10/15 2:59 AM:


This seems to be working as expected in 1.3 branch and in master.


was (Author: cchayden):
This seems to be working as expected in 1.3 branch and in main.

 pySpark zip function unexpected errors
 --

 Key: SPARK-5558
 URL: https://issues.apache.org/jira/browse/SPARK-5558
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Charles Hayden
  Labels: pyspark

 Example:
 {quote}
 x = sc.parallelize(range(0,5))
 y = x.map(lambda x: x+1000, preservesPartitioning=True)
 y.take(10)
 x.zip\(y).collect()
 {quote}
 Fails in the JVM: py4J: org.apache.spark.SparkException: 
 Can only zip RDDs with same number of elements in each partition
 If the range is changed to range(0,1000) it fails in pySpark code:
 ValueError: Can not deserialize RDD with different number of items in pair: 
 (100, 1) 
 It also fails if y.take(10) is replaced with y.toDebugString()
 It even fails if we print y._jrdd



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5706) Support inference schema from a single json string

2015-02-09 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-5706:


 Summary: Support inference schema from a single json string
 Key: SPARK-5706
 URL: https://issues.apache.org/jira/browse/SPARK-5706
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao


We notice some developers are complaining the json parsing is very slow, 
particularly in inferring schema. Some of them suggesting if we can provide an 
simple interface for inferring the schema by providing a single complete json 
string record, instead of sampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5682) Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle

2015-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313380#comment-14313380
 ] 

Apache Spark commented on SPARK-5682:
-

User 'kellyzly' has created a pull request for this issue:
https://github.com/apache/spark/pull/4491

 Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. We reuse hadoop encrypted 
 shuffle feature to spark and because ugi credential info is necessary in 
 encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn 
 framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5682) Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle

2015-02-09 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated SPARK-5682:

Attachment: (was: encrypted_shuffle.patch.4)

 Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. We reuse hadoop encrypted 
 shuffle feature to spark and because ugi credential info is necessary in 
 encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn 
 framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5705) Explore GPU-accelerated Linear Algebra Libraries

2015-02-09 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313398#comment-14313398
 ] 

Evan Sparks commented on SPARK-5705:


This JIRA is a continuation of this thread: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-td10481.html

To summarise - high-speed linear algebra operations including, but not limited 
to, matrix multiplies and solves have the potential to make certain machine 
learning operations faster on spark. However, we've got to be careful to 
balance the overheads of copying data/calling out to the GPU with other factors 
in the design of the system.

Additionally - getting these libraries compiled, linked, built, and configured 
on a target system is unfortunately not trivial. We should make sure we have a 
standard process for doing this (perhaps starting with this codebase: 
http://github.com/shivaram/matrix-bench).

Maybe we should start with some applications where we think GPU acceleration 
could help? Neural nets is one, LDA is another - others?

 Explore GPU-accelerated Linear Algebra Libraries
 

 Key: SPARK-5705
 URL: https://issues.apache.org/jira/browse/SPARK-5705
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Evan Sparks
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5558) pySpark zip function unexpected errors

2015-02-09 Thread Charles Hayden (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313434#comment-14313434
 ] 

Charles Hayden commented on SPARK-5558:
---

This seems to be working as expected in 1.3 branch and in main.

 pySpark zip function unexpected errors
 --

 Key: SPARK-5558
 URL: https://issues.apache.org/jira/browse/SPARK-5558
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Charles Hayden
  Labels: pyspark

 Example:
 {quote}
 x = sc.parallelize(range(0,5))
 y = x.map(lambda x: x+1000, preservesPartitioning=True)
 y.take(10)
 x.zip\(y).collect()
 {quote}
 Fails in the JVM: py4J: org.apache.spark.SparkException: 
 Can only zip RDDs with same number of elements in each partition
 If the range is changed to range(0,1000) it fails in pySpark code:
 ValueError: Can not deserialize RDD with different number of items in pair: 
 (100, 1) 
 It also fails if y.take(10) is replaced with y.toDebugString()
 It even fails if we print y._jrdd



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5597) Model import/export for DecisionTree and ensembles

2015-02-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5597.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4493
[https://github.com/apache/spark/pull/4493]

 Model import/export for DecisionTree and ensembles
 --

 Key: SPARK-5597
 URL: https://issues.apache.org/jira/browse/SPARK-5597
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
 Fix For: 1.3.0


 See parent JIRA for more info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5704) createDataFrame replace applySchema/inferSchema

2015-02-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-5704:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-5166

 createDataFrame replace applySchema/inferSchema
 ---

 Key: SPARK-5704
 URL: https://issues.apache.org/jira/browse/SPARK-5704
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Reporter: Davies Liu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5705) Explore GPU-accelerated Linear Algebra Libraries

2015-02-09 Thread Evan Sparks (JIRA)
Evan Sparks created SPARK-5705:
--

 Summary: Explore GPU-accelerated Linear Algebra Libraries
 Key: SPARK-5705
 URL: https://issues.apache.org/jira/browse/SPARK-5705
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Evan Sparks
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2653) Heap size should be the sum of driver.memory and executor.memory in local mode

2015-02-09 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313680#comment-14313680
 ] 

Davies Liu commented on SPARK-2653:
---

Right now, in local mode, only one of spark.driver.memory or 
spark.executor.memory is used as the heap size of JVM (depends on the order of 
command arguments). I think it should be the sum of them.

 Heap size should be the sum of driver.memory and executor.memory in local mode
 --

 Key: SPARK-2653
 URL: https://issues.apache.org/jira/browse/SPARK-2653
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Davies Liu
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 In local mode, the driver and executor run in the same JVM, so the heap size 
 of JVM should be the sum of spark.driver.memory and spark.executor.memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5709) Add EXPLAIN support for DataFrame API for debugging purpose

2015-02-09 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-5709:


 Summary: Add EXPLAIN support for DataFrame API for debugging 
purpose
 Key: SPARK-5709
 URL: https://issues.apache.org/jira/browse/SPARK-5709
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5702) Allow short names for built-in data sources

2015-02-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-5702:
--

 Summary: Allow short names for built-in data sources
 Key: SPARK-5702
 URL: https://issues.apache.org/jira/browse/SPARK-5702
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


e.g. json, parquet, jdbc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5703) AllJobsPage throws empty.max error

2015-02-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5703:
-
Summary: AllJobsPage throws empty.max error  (was: JobProgressListener 
throws empty.max error)

 AllJobsPage throws empty.max error
 --

 Key: SPARK-5703
 URL: https://issues.apache.org/jira/browse/SPARK-5703
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 In JobProgressListener, if you have a JobEnd that does not have a 
 corresponding JobStart, AND you render the AllJobsPage, then you'll run into 
 the empty.max exception.
 I ran into this when trying to replay parts of an event log after trimming a 
 few events. Not a common use case I agree, but I'd argue that it should never 
 fail on empty.max.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2653) Heap size should be the sum of driver.memory and executor.memory in local mode

2015-02-09 Thread liu chang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313663#comment-14313663
 ] 

liu chang commented on SPARK-2653:
--

Hi, Davies, what's wrong with this?

 Heap size should be the sum of driver.memory and executor.memory in local mode
 --

 Key: SPARK-2653
 URL: https://issues.apache.org/jira/browse/SPARK-2653
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Davies Liu
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 In local mode, the driver and executor run in the same JVM, so the heap size 
 of JVM should be the sum of spark.driver.memory and spark.executor.memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5710) Combines two adjacent `Cast` expressions into one

2015-02-09 Thread guowei (JIRA)
guowei created SPARK-5710:
-

 Summary: Combines two adjacent `Cast` expressions into one
 Key: SPARK-5710
 URL: https://issues.apache.org/jira/browse/SPARK-5710
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: guowei


A plan after `analyzer` with `typeCoercionRules` may produce many `cast` 
expressions. we can combine the adjacent ones.

For example. 
create table test(a decimal(3,1));
explain select * from test where a*2-11;

== Physical Plan ==
Filter (CAST(CAST((CAST(CAST((CAST(a#5, DecimalType()) * 2), 
DecimalType(21,1)), DecimalType()) - 1), DecimalType(22,1)), DecimalType())  1)
 HiveTableScan [a#5], (MetastoreRelation default, test, None), None




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5570) No docs stating that `new SparkConf().set(spark.driver.memory, ...) will not work

2015-02-09 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312416#comment-14312416
 ] 

Ilya Ganelin edited comment on SPARK-5570 at 2/9/15 4:27 PM:
-

I would be happy to fix this. Thank you. 


was (Author: ilganeli):
I'll fix this, can you please assign it to me? Thanks.

 No docs stating that `new SparkConf().set(spark.driver.memory, ...) will 
 not work
 ---

 Key: SPARK-5570
 URL: https://issues.apache.org/jira/browse/SPARK-5570
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Affects Versions: 1.2.0
Reporter: Tathagata Das
Assignee: Andrew Or





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends

2015-02-09 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312382#comment-14312382
 ] 

Ilya Ganelin commented on SPARK-823:


Hi [~joshrosen] I believe the documentation is up to date and I reviewed all 
usages of spark.default.parallelism and found no inconsistencies with the 
documentation. The only thing that is un-documented with regards to the usage 
of spark.default.parallelism is how it's used within the Partitioner class in 
both Spark and Python. If defined, the default number of partitions created is 
equal to spark.default.parallelism - otherwise, it's the local number of 
partitions. I think this issue can be closed - I don't think that particular 
case needs to be publicly documented (it's clearly evident in the code what is 
going on). 

 spark.default.parallelism's default is inconsistent across scheduler backends
 -

 Key: SPARK-823
 URL: https://issues.apache.org/jira/browse/SPARK-823
 Project: Spark
  Issue Type: Bug
  Components: Documentation, PySpark, Scheduler
Affects Versions: 0.8.0, 0.7.3, 0.9.1
Reporter: Josh Rosen
Priority: Minor

 The [0.7.3 configuration 
 guide|http://spark-project.org/docs/latest/configuration.html] says that 
 {{spark.default.parallelism}}'s default is 8, but the default is actually 
 max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos 
 scheduler, and {{threads}} for the local scheduler:
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150
 Should this be clarified in the documentation?  Should the Mesos scheduler 
 backend's default be revised?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4600) org.apache.spark.graphx.VertexRDD.diff does not work

2015-02-09 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312589#comment-14312589
 ] 

Brennon York commented on SPARK-4600:
-

I can take this, thanks!

 org.apache.spark.graphx.VertexRDD.diff does not work
 

 Key: SPARK-4600
 URL: https://issues.apache.org/jira/browse/SPARK-4600
 Project: Spark
  Issue Type: Bug
  Components: GraphX
 Environment: scala 2.10.4
 spark 1.1.0
Reporter: Teppei Tosa
  Labels: graphx

 VertexRDD.diff doesn't work.
 For example : 
 val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 2L).map(id = 
 (id, id.toInt)))
 setA.collect.foreach(println(_))
 // (0,0)
 // (1,1)
 val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(1L until 3L).map(id = 
 (id, id.toInt)))
 setB.collect.foreach(println(_))
 // (1,1)
 // (2,2)
 val diff = setA.diff(setB)
 diff.collect.foreach(println(_))
 // printed none



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1142) Allow adding jars on app submission, outside of code

2015-02-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1142.

Resolution: Not a Problem

 Allow adding jars on app submission, outside of code
 

 Key: SPARK-1142
 URL: https://issues.apache.org/jira/browse/SPARK-1142
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 0.9.0
Reporter: Sandy Pérez González
Assignee: Sandy Ryza

 yarn-standalone mode supports an option that allows adding jars that will be 
 distributed on the cluster with job submission.  Providing similar 
 functionality for other app submission modes will allow the spark-app script 
 proposed in SPARK-1126 to support an add-jars option that works for every 
 submit mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5691) Preventing duplicate registering of an application has incorrect logic

2015-02-09 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-5691:
-

 Summary: Preventing duplicate registering of an application has 
incorrect logic
 Key: SPARK-5691
 URL: https://issues.apache.org/jira/browse/SPARK-5691
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0, 1.1.1
Reporter: Matt Cheah
 Fix For: 1.3.0


If an application registers twice with the Master, the Master accepts both 
copies and they both show up in the UI and consume resources. This is incorrect 
behavior.

This happens inadvertently in regular usage when the Master is under high load, 
but it boils down to: when an application times out registering with the master 
and sends a second registration message, but the Master is still alive, it 
processes the first registration message for the application but also 
erroneously processes the second registration message instead of discarding it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5692) Model import/export for Word2Vec

2015-02-09 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5692:


 Summary: Model import/export for Word2Vec
 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng


Supoort save and load for Word2VecModel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5692) Model import/export for Word2Vec

2015-02-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5692:
-
Description: Supoort save and load for Word2VecModel. We may want to 
discuss whether we want to be compatible with the original Word2Vec model 
storage format.  (was: Supoort save and load for Word2VecModel.)

 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng

 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5693) Install Pandas on Jenkins machines and enable to_pandas doctest for DataFrames

2015-02-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-5693:
--

 Summary: Install Pandas on Jenkins machines and enable to_pandas 
doctest for DataFrames
 Key: SPARK-5693
 URL: https://issues.apache.org/jira/browse/SPARK-5693
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Reporter: Reynold Xin
Assignee: Patrick Wendell


DataFrame.to_pandas doctests are disabled as Jenkins machines don't have Pandas 
installed. 

[~pwendell] I assigned this to you but feel free to delegate. Thanks.

cc [~davies]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5694) Python API for evaluation metrics

2015-02-09 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5694:


 Summary: Python API for evaluation metrics
 Key: SPARK-5694
 URL: https://issues.apache.org/jira/browse/SPARK-5694
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng


Add Python API for evaluation metrics defined under `mllib.evaluation`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4900) MLlib SingularValueDecomposition ARPACK IllegalStateException

2015-02-09 Thread Mike Beyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312769#comment-14312769
 ] 

Mike Beyer commented on SPARK-4900:
---

put a snapshot test data 1000x1000 matrix to 
https://dl.dropboxusercontent.com/u/8489998/test_matrix_1.zip

calling: 
String filename = /custompath/27637/test_matrix_1;
RDDVector vectorRDD = 
MLUtils.loadVectors(javaSparkContext.sc(), filename);
vectorRDD.cache();
System.out.println(trtRowRDD.count():\t + 
vectorRDD.count());
RowMatrix rowMatrix = new RowMatrix(vectorRDD);
System.out.println(rowMatrix.numRows():\t + 
rowMatrix.numRows());
System.out.println(rowMatrix.numCols():\t + 
rowMatrix.numCols());
{
int k = 10;
boolean computeU = true;
double rCond = 1.0E-9d;
SingularValueDecompositionRowMatrix, Matrix 
svd = rowMatrix.computeSVD(k, computeU, rCond);
RowMatrix u = svd.U();
RDDVector uRowsRDD = u.rows();
System.out.println(uRowsRDD.count():\t + 
uRowsRDD.count());
Vector s = svd.s();
System.out.println(s.size():\t + s.size());
Matrix v = svd.V();
System.out.println(v.numRows():\t + 
v.numRows());
System.out.println(v.numCols():\t + 
v.numCols());
}

results in:


maxFeatureSpaceTermNumber:  1000
trtRowRDD.count():  1000
rowMatrix.numRows():1000
rowMatrix.numCols():1000
15/02/09 19:56:59 WARN PrimaryRunnerSpark:
java.lang.IllegalStateException: ARPACK returns non-zero info = 3 Please refer 
ARPACK user guide for error message.
at 
org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:120)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:258)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:190)
at 
com.example.processing.spark.SVDProcessing2.createSVD_2(SVDProcessing2.java:184)
at 
com.example.processing.spark.RunnerSpark.main(PrimaryRunnerSpark.java:27)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at sbt.Run.invokeMain(Run.scala:67)
at sbt.Run.run0(Run.scala:61)
at sbt.Run.sbt$Run$$execute$1(Run.scala:51)
at sbt.Run$$anonfun$run$1.apply$mcV$sp(Run.scala:55)
at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
at sbt.Logger$$anon$4.apply(Logger.scala:85)
at sbt.TrapExit$App.run(TrapExit.scala:248)
at java.lang.Thread.run(Thread.java:745)
15/02/09 19:56:59 INFO TimeCounter: TIMER 
[com.example.processing.spark.PrimaryRunnerSpark] : 13.0 Seconds
TIMER [com.example.processing.spark.PrimaryRunnerSpark] : 13.0 Seconds
15/02/09 19:56:59 ERROR ContextCleaner: Error cleaning broadcast 20
java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137)
at 
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at 
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)
at 
org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:185)
at 

[jira] [Updated] (SPARK-5195) when hive table is query with alias the cache data lose effectiveness.

2015-02-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5195:
---
Assignee: yixiaohua

 when hive table is query with alias  the cache data  lose effectiveness.
 

 Key: SPARK-5195
 URL: https://issues.apache.org/jira/browse/SPARK-5195
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: yixiaohua
Assignee: yixiaohua
 Fix For: 1.3.0


 override the MetastoreRelation's sameresult method only compare databasename 
 and table name
 because in previous :
 cache table t1;
 select count() from t1;
 it will read data from memory but the sql below will not,instead it read from 
 hdfs:
 select count() from t1 t;
 because cache data is keyed by logical plan and compare with sameResult ,so 
 when table with alias the same table 's logicalplan is not the same logical 
 plan with out alias so modify the sameresult method only compare databasename 
 and table name



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5343) ShortestPaths traverses backwards

2015-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312651#comment-14312651
 ] 

Apache Spark commented on SPARK-5343:
-

User 'brennonyork' has created a pull request for this issue:
https://github.com/apache/spark/pull/4478

 ShortestPaths traverses backwards
 -

 Key: SPARK-5343
 URL: https://issues.apache.org/jira/browse/SPARK-5343
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0
Reporter: Michael Malak

 GraphX ShortestPaths seems to be following edges backwards instead of 
 forwards:
 import org.apache.spark.graphx._
 val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), 
 sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L,
 lib.ShortestPaths.run(g,Array(3)).vertices.collect
 res1: Array[(org.apache.spark.graphx.VertexId, 
 org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 
 - 0)), (2,Map()))
 lib.ShortestPaths.run(g,Array(1)).vertices.collect
 res2: Array[(org.apache.spark.graphx.VertexId, 
 org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), 
 (3,Map(1 - 2)), (2,Map(1 - 1)))
 The following changes may be what will make it run forward:
 Change one occurrence of src to dst in
 https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64
 Change three occurrences of dst to src in
 https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5679) Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method

2015-02-09 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5679:
--
Labels: flaky-test  (was: )

 Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads 
 and input metrics with mixed read method 
 --

 Key: SPARK-5679
 URL: https://issues.apache.org/jira/browse/SPARK-5679
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Kostas Sakellis
Priority: Blocker
  Labels: flaky-test

 Please audit these and see if there are any assumptions with respect to File 
 IO that might not hold in all cases. I'm happy to help if you can't find 
 anything.
 These both failed in the same run:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-SBT/38/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink
 {code}
 org.apache.spark.metrics.InputOutputMetricsSuite.input metrics with mixed 
 read method
 Failing for the past 13 builds (Since Failed#26 )
 Took 48 sec.
 Error Message
 2030 did not equal 6496
 Stacktrace
 sbt.ForkMain$ForkError: 2030 did not equal 6496
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at 
 org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply$mcV$sp(InputOutputMetricsSuite.scala:135)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfter$$super$runTest(InputOutputMetricsSuite.scala:46)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.runTest(InputOutputMetricsSuite.scala:46)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfterAll$$super$run(InputOutputMetricsSuite.scala:46)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
   at 
 

[jira] [Resolved] (SPARK-5664) Restore stty settings when exiting for launching spark-shell from SBT

2015-02-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5664.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4451
[https://github.com/apache/spark/pull/4451]

 Restore stty settings when exiting for launching spark-shell from SBT
 -

 Key: SPARK-5664
 URL: https://issues.apache.org/jira/browse/SPARK-5664
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Liang-Chi Hsieh
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3242) Spark 1.0.2 ec2 scripts creates clusters with Spark 1.0.1 installed by default

2015-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3242.
--
Resolution: Fixed

Resolved by 
https://github.com/apache/spark/commit/444ccdd80ec5df249978d8498b4fc501cc3429d7

 Spark 1.0.2 ec2 scripts creates clusters with Spark 1.0.1 installed by default
 --

 Key: SPARK-3242
 URL: https://issues.apache.org/jira/browse/SPARK-3242
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.0.2
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5311) EventLoggingListener throws exception if log directory does not exist

2015-02-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5311.

Resolution: Won't Fix
  Assignee: Josh Rosen

 EventLoggingListener throws exception if log directory does not exist
 -

 Key: SPARK-5311
 URL: https://issues.apache.org/jira/browse/SPARK-5311
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 If the log directory does not exist, EventLoggingListener throws an 
 IllegalArgumentException.  Here's a simple reproduction (using the master 
 branch (1.3.0)):
 {code}
 ./bin/spark-shell --conf spark.eventLog.enabled=true --conf 
 spark.eventLog.dir=/tmp/nonexistent-dir
 {code}
 where /tmp/nonexistent-dir is a directory that doesn't exist and /tmp exists. 
  This results in the following exception:
 {code}
 15/01/18 17:10:44 INFO HttpServer: Starting HTTP Server
 15/01/18 17:10:44 INFO Utils: Successfully started service 'HTTP file server' 
 on port 62729.
 15/01/18 17:10:44 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
 Attempting port 4041.
 15/01/18 17:10:44 INFO Utils: Successfully started service 'SparkUI' on port 
 4041.
 15/01/18 17:10:44 INFO SparkUI: Started SparkUI at 
 http://joshs-mbp.att.net:4041
 15/01/18 17:10:45 INFO Executor: Using REPL class URI: 
 http://192.168.1.248:62726
 15/01/18 17:10:45 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
 akka.tcp://sparkdri...@joshs-mbp.att.net:62728/user/HeartbeatReceiver
 15/01/18 17:10:45 INFO NettyBlockTransferService: Server created on 62730
 15/01/18 17:10:45 INFO BlockManagerMaster: Trying to register BlockManager
 15/01/18 17:10:45 INFO BlockManagerMasterActor: Registering block manager 
 localhost:62730 with 265.4 MB RAM, BlockManagerId(driver, localhost, 62730)
 15/01/18 17:10:45 INFO BlockManagerMaster: Registered BlockManager
 java.lang.IllegalArgumentException: Log directory /tmp/nonexistent-dir does 
 not exist.
   at 
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:90)
   at org.apache.spark.SparkContext.init(SparkContext.scala:363)
   at 
 org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986)
   at $iwC$$iwC.init(console:9)
   at $iwC.init(console:18)
   at init(console:20)
   at .init(console:24)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:123)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:270)
   at 
 org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:60)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:945)
   at 
 org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:147)
   at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:60)
   at 
 org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
   at 
 org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:60)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:962)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
   at 
 

[jira] [Commented] (SPARK-5589) Split pyspark/sql.py into multiple files

2015-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312781#comment-14312781
 ] 

Apache Spark commented on SPARK-5589:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4479

 Split pyspark/sql.py into multiple files
 

 Key: SPARK-5589
 URL: https://issues.apache.org/jira/browse/SPARK-5589
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu

 sql.py have more than 2k LOCs, it should be splitted into multiple modules.
 Also, put all data types into pyspark.sql.types



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5695) Check GBT caching logic

2015-02-09 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5695:


 Summary: Check GBT caching logic
 Key: SPARK-5695
 URL: https://issues.apache.org/jira/browse/SPARK-5695
 Project: Spark
  Issue Type: Task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


A user reported that `t(n) = t(n-1) + const`, which may caused by cached RDDs 
being kicked out during training. We want to check whether there exists space 
to improve the caching logic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4900) MLlib SingularValueDecomposition ARPACK IllegalStateException

2015-02-09 Thread Mike Beyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Beyer updated SPARK-4900:
--
Affects Version/s: 1.2.1

 MLlib SingularValueDecomposition ARPACK IllegalStateException 
 --

 Key: SPARK-4900
 URL: https://issues.apache.org/jira/browse/SPARK-4900
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0, 1.2.1
 Environment: Ubuntu 1410, Java HotSpot(TM) 64-Bit Server VM (build 
 25.25-b02, mixed mode)
 spark local mode
Reporter: Mike Beyer

 java.lang.reflect.InvocationTargetException
 ...
 Caused by: java.lang.IllegalStateException: ARPACK returns non-zero info = 3 
 Please refer ARPACK user guide for error message.
 at 
 org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:120)
 at 
 org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:235)
 at 
 org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:171)
   ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4900) MLlib SingularValueDecomposition ARPACK IllegalStateException

2015-02-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312795#comment-14312795
 ] 

Sean Owen commented on SPARK-4900:
--

So I think there is at least a small problem in the error reporting:

{code}
  info.`val` match {
case 1 = throw new IllegalStateException(ARPACK returns non-zero info 
=  + info.`val` +
 Maximum number of iterations taken. (Refer ARPACK user guide for 
details))
case 2 = throw new IllegalStateException(ARPACK returns non-zero info 
=  + info.`val` +
 No shifts could be applied. Try to increase NCV.  +
(Refer ARPACK user guide for details))
case _ = throw new IllegalStateException(ARPACK returns non-zero info 
=  + info.`val` +
 Please refer ARPACK user guide for error message.)
  }
{code}

Really, what's called case 2 here corresponds to return value 3, which is what 
you get.

{code}
=  0: Normal exit.
=  1: Maximum number of iterations taken.
  All possible eigenvalues of OP has been found. IPARAM(5)  
  returns the number of wanted converged Ritz values.
=  2: No longer an informational error. Deprecated starting
  with release 2 of ARPACK.
=  3: No shifts could be applied during a cycle of the 
  Implicitly restarted Arnoldi iteration. One possibility 
  is to increase the size of NCV relative to NEV. 
  See remark 4 below.
{code}

I can fix the error message. Remark 4 that it refers to is:

{code}
4. At present there is no a-priori analysis to guide the selection
   of NCV relative to NEV.  The only formal requrement is that NCV  NEV.
   However, it is recommended that NCV .ge. 2*NEV.  If many problems of
   the same type are to be solved, one should experiment with increasing
   NCV while keeping NEV fixed for a given test problem.  This will 
   usually decrease the required number of OP*x operations but it
   also increases the work and storage required to maintain the orthogonal
   basis vectors.   The optimal cross-over with respect to CPU time
   is problem dependent and must be determined empirically.
{code}

So I think that translates to k is too big. Is the matrix low-rank?
In any event this is ultimately breeze code and I'm not sure if there's much 
that will done in Spark itself.

 MLlib SingularValueDecomposition ARPACK IllegalStateException 
 --

 Key: SPARK-4900
 URL: https://issues.apache.org/jira/browse/SPARK-4900
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0, 1.2.1
 Environment: Ubuntu 1410, Java HotSpot(TM) 64-Bit Server VM (build 
 25.25-b02, mixed mode)
 spark local mode
Reporter: Mike Beyer

 java.lang.reflect.InvocationTargetException
 ...
 Caused by: java.lang.IllegalStateException: ARPACK returns non-zero info = 3 
 Please refer ARPACK user guide for error message.
 at 
 org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:120)
 at 
 org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:235)
 at 
 org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:171)
   ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5683) Improve the json serialization for DataFrame API

2015-02-09 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-5683:


 Summary: Improve the json serialization for DataFrame API
 Key: SPARK-5683
 URL: https://issues.apache.org/jira/browse/SPARK-5683
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5683) Improve the json serialization for DataFrame API

2015-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311923#comment-14311923
 ] 

Apache Spark commented on SPARK-5683:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/4468

 Improve the json serialization for DataFrame API
 

 Key: SPARK-5683
 URL: https://issues.apache.org/jira/browse/SPARK-5683
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5684) Key not found exception is thrown in case location of added partition to a parquet table is different than a path containing the partition values

2015-02-09 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-5684:
--
Priority: Major  (was: Critical)

 Key not found exception is thrown in case location of added partition to a 
 parquet table is different than a path containing the partition values
 -

 Key: SPARK-5684
 URL: https://issues.apache.org/jira/browse/SPARK-5684
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0
Reporter: Yash Datta
 Fix For: 1.3.0


 Create a partitioned parquet table : 
 create table test_table (dummy string) partitioned by (timestamp bigint) 
 stored as parquet;
 Add a partition to the table and specify a different location:
 alter table test_table add partition (timestamp=9) location 
 '/data/pth/different'
 Run a simple select  * query 
 we get an exception :
 15/02/09 08:27:25 ERROR thriftserver.SparkSQLDriver: Failed in [select * from 
 db4_mi2mi_binsrc1_default limit 5]
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 
 (TID 21, localhost): java
 .util.NoSuchElementException: key not found: timestamp
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.MapLike$class.apply(MapLike.scala:141)
 at scala.collection.AbstractMap.apply(Map.scala:58)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:141)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:128)
 at 
 org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:247)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 This happens because in parquet path it is assumed that (key=value) patterns 
 are present in the partition location, which is not always the case!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5685) Show warning when users open text files compressed with non-splittable algorithms like gzip

2015-02-09 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-5685:
---

 Summary: Show warning when users open text files compressed with 
non-splittable algorithms like gzip
 Key: SPARK-5685
 URL: https://issues.apache.org/jira/browse/SPARK-5685
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Nicholas Chammas
Priority: Minor


This is a usability or user-friendliness issue.

It's extremely common for people to load a text file compressed with gzip, 
process it, and then wonder why only 1 core in their cluster is doing any work.

Some examples:
* http://stackoverflow.com/q/28127119/877069
* http://stackoverflow.com/q/27531816/877069

I'm not sure how this problem can be generalized, but at the very least it 
would be helpful if Spark displayed some kind of warning in the common case 
when someone opens a gzipped file with {{sc.textFile}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5710) Combines two adjacent `Cast` expressions into one

2015-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313729#comment-14313729
 ] 

Apache Spark commented on SPARK-5710:
-

User 'guowei2' has created a pull request for this issue:
https://github.com/apache/spark/pull/4497

 Combines two adjacent `Cast` expressions into one
 -

 Key: SPARK-5710
 URL: https://issues.apache.org/jira/browse/SPARK-5710
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: guowei

 A plan after `analyzer` with `typeCoercionRules` may produce many `cast` 
 expressions. we can combine the adjacent ones.
 For example. 
 create table test(a decimal(3,1));
 explain select * from test where a*2-11;
 == Physical Plan ==
 Filter (CAST(CAST((CAST(CAST((CAST(a#5, DecimalType()) * 2), 
 DecimalType(21,1)), DecimalType()) - 1), DecimalType(22,1)), DecimalType())  
 1)
  HiveTableScan [a#5], (MetastoreRelation default, test, None), None



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5711) Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-09 Thread Sun Fulin (JIRA)
Sun Fulin created SPARK-5711:


 Summary: Sort Shuffle performance issues about using AppendOnlyMap 
for large data sets
 Key: SPARK-5711
 URL: https://issues.apache.org/jira/browse/SPARK-5711
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: hbase-0.98.6-cdh5.2.0 phoenix-4.2.2
Reporter: Sun Fulin


Recently we had caught performance issues when using spark 1.2.0 to read data 
from hbase and do some summary work.
Our scenario means to : read large data sets from hbase (maybe 100G+ file) , 
form hbaseRDD, transform to schemardd, 
groupby and aggregate the data while got fewer new summary data sets, loading 
data into hbase (phoenix).

Our major issue lead to : aggregate large datasets to get summary data sets 
would consume too long time (1 hour +) , while that
should be supposed not so bad performance. We got the dump file attached and 
stacktrace from jstack like the following:

From the stacktrace and dump file we can identify that processing large 
datasets would cause frequent AppendOnlyMap growing, and 
leading to huge map entrysize. We had referenced the source code of 
org.apache.spark.util.collection.AppendOnlyMap and found that 
the map had been initialized with capacity of 64. That would be too small for 
our use case. 

Thread 22432: (state = IN_JAVA)
- org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 
(Compiled frame; information may be imprecise)
- org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() 
@bci=1, line=38 (Interpreted frame)
- org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, 
line=198 (Compiled frame)
- org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, 
scala.Function2) @bci=201, line=145 (Compiled frame)
- 
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=3, line=32 (Compiled frame)
- 
org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
 @bci=141, line=205 (Compiled frame)
- 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
 @bci=74, line=58 (Interpreted frame)
- 
org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) 
@bci=169, line=68 (Interpreted frame)
- 
org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) 
@bci=2, line=41 (Interpreted frame)
- org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted frame)
- org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 
(Interpreted frame)
- 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=95, line=1145 (Interpreted frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 
(Interpreted frame)
- java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)


Thread 22431: (state = IN_JAVA)
- org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 
(Compiled frame; information may be imprecise)
- org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() 
@bci=1, line=38 (Interpreted frame)
- org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, 
line=198 (Compiled frame)
- org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, 
scala.Function2) @bci=201, line=145 (Compiled frame)
- 
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=3, line=32 (Compiled frame)
- 
org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
 @bci=141, line=205 (Compiled frame)
- 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
 @bci=74, line=58 (Interpreted frame)
- 
org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) 
@bci=169, line=68 (Interpreted frame)
- 
org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) 
@bci=2, line=41 (Interpreted frame)
- org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted frame)
- org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 
(Interpreted frame)
- 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=95, line=1145 (Interpreted frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 
(Interpreted frame)
- java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5700) Bump jets3t version from 0.9.2 to 0.9.3 in hadoop-2.3 and hadoop-2.4 profiles

2015-02-09 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313746#comment-14313746
 ] 

Josh Rosen commented on SPARK-5700:
---

Looks like 0.9.3 is now on Maven Central: 
http://search.maven.org/#artifactdetails%7Cnet.java.dev.jets3t%7Cjets3t%7C0.9.3%7Cjar

 Bump jets3t version from 0.9.2 to 0.9.3 in hadoop-2.3 and hadoop-2.4 profiles
 -

 Key: SPARK-5700
 URL: https://issues.apache.org/jira/browse/SPARK-5700
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
  Labels: flaky-test

 This is a follow-up ticket for SPARK-5671 and SPARK-5696.
 JetS3t 0.9.2 contains a log4j.properties file inside the artifact and breaks 
 our tests (see SPARK-5696). This is fixed in 0.9.3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5704) createDataFrame replace applySchema/inferSchema

2015-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313763#comment-14313763
 ] 

Apache Spark commented on SPARK-5704:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4498

 createDataFrame replace applySchema/inferSchema
 ---

 Key: SPARK-5704
 URL: https://issues.apache.org/jira/browse/SPARK-5704
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Reporter: Davies Liu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4655) Split Stage into ShuffleMapStage and ResultStage subclasses

2015-02-09 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4655:
--
Target Version/s: 1.4.0  (was: 1.3.0)
Assignee: Ilya Ganelin  (was: Josh Rosen)

Hi [~ilganeli], feel free to work on this; I've assigned this JIRA to you.

 Split Stage into ShuffleMapStage and ResultStage subclasses
 ---

 Key: SPARK-4655
 URL: https://issues.apache.org/jira/browse/SPARK-4655
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Ilya Ganelin

 The scheduler's {{Stage}} class has many fields which are only applicable to 
 result stages or shuffle map stages.  As a result, I think that it makes 
 sense to make {{Stage}} into an abstract base class with two subclasses, 
 {{ResultStage}} and {{ShuffleMapStage}}.  This would improve the 
 understandability of the DAGScheduler code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5678) DataFrame.to_pandas

2015-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312568#comment-14312568
 ] 

Apache Spark commented on SPARK-5678:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4476

 DataFrame.to_pandas 
 

 Key: SPARK-5678
 URL: https://issues.apache.org/jira/browse/SPARK-5678
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu

 to_pandas to convert a DataFrame into a Pandas DataFrame. Note that the whole 
 DataFrame API should still work even when Pandas is not available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later

2015-02-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4267.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Sean Owen

 Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
 --

 Key: SPARK-4267
 URL: https://issues.apache.org/jira/browse/SPARK-4267
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Tsuyoshi OZAWA
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.3.0


 Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 
 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this:
 {code}
  ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn 
 -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0
 {code}
 Then Spark on YARN fails to launch jobs with NPE.
 {code}
 $ bin/spark-shell --master yarn-client
 scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line 
 = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a 
 + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2);
 java.lang.NullPointerException
   
   
 
 at 
 org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284)
 at 
 org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291)   
   
   
  
 at 
 org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480)
 at $iwC$$iwC$$iwC$$iwC.init(console:13)   
   
   
 
 at $iwC$$iwC$$iwC.init(console:18)
 at $iwC$$iwC.init(console:20) 
   
   
 
 at $iwC.init(console:22)
 at init(console:24)   
   
   
 
 at .init(console:28)
 at .clinit(console)   
   
   
 
 at .init(console:7)
 at .clinit(console)   
   
   
 
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
   
 
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   
   
   
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) 
   
   
  
 at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
 at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
   
   
  
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)   
   

[jira] [Commented] (SPARK-5679) Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method

2015-02-09 Thread Kostas Sakellis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312600#comment-14312600
 ] 

Kostas Sakellis commented on SPARK-5679:


I have tried to repo this in a number of different ways and failed:

1. On mac with sbt and without sbt.
2. On centos 6x using sbt and without sbt
3. On ubuntu using sbt and without sbt

Yet it is reproducible on the build machines but only for hadoop 2.2. As 
[~joshrosen] pointed out, it might be some shared state that is specific to 
older versions of hadoop. How do we feel about changing this test suite so that 
it doesn't use a shared spark context. I know it will slow down the test a bit 
but might be the easiest way. 

 Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads 
 and input metrics with mixed read method 
 --

 Key: SPARK-5679
 URL: https://issues.apache.org/jira/browse/SPARK-5679
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Kostas Sakellis
Priority: Blocker

 Please audit these and see if there are any assumptions with respect to File 
 IO that might not hold in all cases. I'm happy to help if you can't find 
 anything.
 These both failed in the same run:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-SBT/38/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink
 {code}
 org.apache.spark.metrics.InputOutputMetricsSuite.input metrics with mixed 
 read method
 Failing for the past 13 builds (Since Failed#26 )
 Took 48 sec.
 Error Message
 2030 did not equal 6496
 Stacktrace
 sbt.ForkMain$ForkError: 2030 did not equal 6496
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at 
 org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply$mcV$sp(InputOutputMetricsSuite.scala:135)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfter$$super$runTest(InputOutputMetricsSuite.scala:46)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.runTest(InputOutputMetricsSuite.scala:46)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 

[jira] [Updated] (SPARK-5690) Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion

2015-02-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5690:
-
Affects Version/s: 1.3.0

 Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple 
 submit until completion
 -

 Key: SPARK-5690
 URL: https://issues.apache.org/jira/browse/SPARK-5690
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Andrew Or
  Labels: flaky-test

 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/1647/testReport/junit/org.apache.spark.deploy.rest/StandaloneRestSubmitSuite/simple_submit_until_completion/
 {code}
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until 
 completion
 Failing for the past 1 build (Since Failed#1647 )
 Took 30 sec.
 Error Message
 Driver driver-20150209035158- did not finish within 30 seconds.
 Stacktrace
 sbt.ForkMain$ForkError: Driver driver-20150209035158- did not finish 
 within 30 seconds.
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$apache$spark$deploy$rest$StandaloneRestSubmitSuite$$waitUntilFinished(StandaloneRestSubmitSuite.scala:152)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply$mcV$sp(StandaloneRestSubmitSuite.scala:57)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StandaloneRestSubmitSuite.scala:41)
   at 
 org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.runTest(StandaloneRestSubmitSuite.scala:41)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterAll$$super$run(StandaloneRestSubmitSuite.scala:41)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   

[jira] [Updated] (SPARK-5690) Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion

2015-02-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5690:
-
Priority: Critical  (was: Major)

 Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple 
 submit until completion
 -

 Key: SPARK-5690
 URL: https://issues.apache.org/jira/browse/SPARK-5690
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Andrew Or
Priority: Critical
  Labels: flaky-test

 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/1647/testReport/junit/org.apache.spark.deploy.rest/StandaloneRestSubmitSuite/simple_submit_until_completion/
 {code}
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until 
 completion
 Failing for the past 1 build (Since Failed#1647 )
 Took 30 sec.
 Error Message
 Driver driver-20150209035158- did not finish within 30 seconds.
 Stacktrace
 sbt.ForkMain$ForkError: Driver driver-20150209035158- did not finish 
 within 30 seconds.
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$apache$spark$deploy$rest$StandaloneRestSubmitSuite$$waitUntilFinished(StandaloneRestSubmitSuite.scala:152)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply$mcV$sp(StandaloneRestSubmitSuite.scala:57)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StandaloneRestSubmitSuite.scala:41)
   at 
 org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.runTest(StandaloneRestSubmitSuite.scala:41)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterAll$$super$run(StandaloneRestSubmitSuite.scala:41)
   at 
 

[jira] [Created] (SPARK-5690) Flaky test:

2015-02-09 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-5690:
--

 Summary: Flaky test: 
 Key: SPARK-5690
 URL: https://issues.apache.org/jira/browse/SPARK-5690
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Patrick Wendell
Assignee: Andrew Or


https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/1647/testReport/junit/org.apache.spark.deploy.rest/StandaloneRestSubmitSuite/simple_submit_until_completion/

{code}
org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until 
completion

Failing for the past 1 build (Since Failed#1647 )
Took 30 sec.
Error Message

Driver driver-20150209035158- did not finish within 30 seconds.
Stacktrace

sbt.ForkMain$ForkError: Driver driver-20150209035158- did not finish within 
30 seconds.
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
at 
org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$apache$spark$deploy$rest$StandaloneRestSubmitSuite$$waitUntilFinished(StandaloneRestSubmitSuite.scala:152)
at 
org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply$mcV$sp(StandaloneRestSubmitSuite.scala:57)
at 
org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52)
at 
org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at 
org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StandaloneRestSubmitSuite.scala:41)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
at 
org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.runTest(StandaloneRestSubmitSuite.scala:41)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at 
org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterAll$$super$run(StandaloneRestSubmitSuite.scala:41)
at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
at 
org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.run(StandaloneRestSubmitSuite.scala:41)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
at 

[jira] [Updated] (SPARK-5690) Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion

2015-02-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5690:
---
Summary: Flaky test: 
org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until 
completion  (was: Flaky test: )

 Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple 
 submit until completion
 -

 Key: SPARK-5690
 URL: https://issues.apache.org/jira/browse/SPARK-5690
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Patrick Wendell
Assignee: Andrew Or
  Labels: flaky-test

 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/1647/testReport/junit/org.apache.spark.deploy.rest/StandaloneRestSubmitSuite/simple_submit_until_completion/
 {code}
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until 
 completion
 Failing for the past 1 build (Since Failed#1647 )
 Took 30 sec.
 Error Message
 Driver driver-20150209035158- did not finish within 30 seconds.
 Stacktrace
 sbt.ForkMain$ForkError: Driver driver-20150209035158- did not finish 
 within 30 seconds.
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$apache$spark$deploy$rest$StandaloneRestSubmitSuite$$waitUntilFinished(StandaloneRestSubmitSuite.scala:152)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply$mcV$sp(StandaloneRestSubmitSuite.scala:57)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StandaloneRestSubmitSuite.scala:41)
   at 
 org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.runTest(StandaloneRestSubmitSuite.scala:41)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterAll$$super$run(StandaloneRestSubmitSuite.scala:41)
   at 
 

[jira] [Updated] (SPARK-5689) Document what can be run in different YARN modes

2015-02-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5689:
---
Issue Type: Documentation  (was: Improvement)

 Document what can be run in different YARN modes
 

 Key: SPARK-5689
 URL: https://issues.apache.org/jira/browse/SPARK-5689
 Project: Spark
  Issue Type: Documentation
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves

 We should document what can be run in the different yarn modes. For 
 instances, the interactive shell only work in yarn client mode, recently with 
 https://github.com/apache/spark/pull/3976 users can run python scripts in 
 cluster mode, etc..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1142) Allow adding jars on app submission, outside of code

2015-02-09 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312608#comment-14312608
 ] 

Patrick Wendell commented on SPARK-1142:


This already exists - you can use the --jars flag to spark-submit or set 
'spark.jars' manually.

 Allow adding jars on app submission, outside of code
 

 Key: SPARK-1142
 URL: https://issues.apache.org/jira/browse/SPARK-1142
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 0.9.0
Reporter: Sandy Pérez González
Assignee: Sandy Ryza

 yarn-standalone mode supports an option that allows adding jars that will be 
 distributed on the cluster with job submission.  Providing similar 
 functionality for other app submission modes will allow the spark-app script 
 proposed in SPARK-1126 to support an add-jars option that works for every 
 submit mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5685) Show warning when users open text files compressed with non-splittable algorithms like gzip

2015-02-09 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312622#comment-14312622
 ] 

Josh Rosen commented on SPARK-5685:
---

[~nchammas], in general I'm a huge fan of runtime warnings / better exceptions, 
especially for common issues like this.  I wonder if it would be too noisy to 
log a warning every time textFile was used on compressed input; instead, what 
do you think about logging only in cases where minPartitions  1 and the input 
is compressed?  This would cover the case where a user sees that they're not 
obtaining sufficient parallelism and then tries to increase the parallelism.

Also, what happens if the user specifies the path of a directory that contains 
many input files, some of which are compressed and some which aren't?  Does the 
driver know whether the files are compressed in an unsplitable way, or do we 
only discover this on the executors once the job runs?

 Show warning when users open text files compressed with non-splittable 
 algorithms like gzip
 ---

 Key: SPARK-5685
 URL: https://issues.apache.org/jira/browse/SPARK-5685
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Nicholas Chammas
Priority: Minor

 This is a usability or user-friendliness issue.
 It's extremely common for people to load a text file compressed with gzip, 
 process it, and then wonder why only 1 core in their cluster is doing any 
 work.
 Some examples:
 * http://stackoverflow.com/q/28127119/877069
 * http://stackoverflow.com/q/27531816/877069
 I'm not sure how this problem can be generalized, but at the very least it 
 would be helpful if Spark displayed some kind of warning in the common case 
 when someone opens a gzipped file with {{sc.textFile}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5343) ShortestPaths traverses backwards

2015-02-09 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312591#comment-14312591
 ] 

Brennon York commented on SPARK-5343:
-

I'll take this issue, thanks.

 ShortestPaths traverses backwards
 -

 Key: SPARK-5343
 URL: https://issues.apache.org/jira/browse/SPARK-5343
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0
Reporter: Michael Malak

 GraphX ShortestPaths seems to be following edges backwards instead of 
 forwards:
 import org.apache.spark.graphx._
 val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), 
 sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L,
 lib.ShortestPaths.run(g,Array(3)).vertices.collect
 res1: Array[(org.apache.spark.graphx.VertexId, 
 org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 
 - 0)), (2,Map()))
 lib.ShortestPaths.run(g,Array(1)).vertices.collect
 res2: Array[(org.apache.spark.graphx.VertexId, 
 org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), 
 (3,Map(1 - 2)), (2,Map(1 - 1)))
 The following changes may be what will make it run forward:
 Change one occurrence of src to dst in
 https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64
 Change three occurrences of dst to src in
 https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3355) Allow running maven tests in run-tests

2015-02-09 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312588#comment-14312588
 ] 

Brennon York commented on SPARK-3355:
-

I've started this and should have the fix up shortly. It currently leverages 
the AMPLAB_JENKINS_BUILD_TOOL environment var though and wanted to sync to see, 
since this is duplicated now, if that still makes sense.

 Allow running maven tests in run-tests
 --

 Key: SPARK-3355
 URL: https://issues.apache.org/jira/browse/SPARK-3355
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell

 We should have a variable called AMPLAB_JENKINS_BUILD_TOOL that decides 
 whether to run sbt or maven.
 This would allow us to simplify our build matrix in Jenkins... currently the 
 maven builds run a totally different thing than the normal run-tests builds.
 The maven build currently does something like this:
 {code}
 mvn -DskipTests -Pprofile1 -Pprofile2 ... clean package
 mvn test -Pprofile1 -Pprofile2 ... --fail-at-end
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5691) Preventing duplicate registering of an application has incorrect logic

2015-02-09 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312632#comment-14312632
 ] 

Matt Cheah commented on SPARK-5691:
---

I've determined that this is a pretty simple bug in the Master code.

I'm on commit hash 0793ee1b4dea1f4b0df749e8ad7c1ab70b512faf. In Master.scala, 
in the registerApplication method, it checks if the application is already 
registered by checking the addressToWorker data structure. In reality, this is 
wrong - it should examine the addressToApp data structure.

I'll submit a pull request shortly.

 Preventing duplicate registering of an application has incorrect logic
 --

 Key: SPARK-5691
 URL: https://issues.apache.org/jira/browse/SPARK-5691
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.1, 1.2.0
Reporter: Matt Cheah
 Fix For: 1.3.0


 If an application registers twice with the Master, the Master accepts both 
 copies and they both show up in the UI and consume resources. This is 
 incorrect behavior.
 This happens inadvertently in regular usage when the Master is under high 
 load, but it boils down to: when an application times out registering with 
 the master and sends a second registration message, but the Master is still 
 alive, it processes the first registration message for the application but 
 also erroneously processes the second registration message instead of 
 discarding it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4600) org.apache.spark.graphx.VertexRDD.diff does not work

2015-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-4600:


Assignee: Brennon York

 org.apache.spark.graphx.VertexRDD.diff does not work
 

 Key: SPARK-4600
 URL: https://issues.apache.org/jira/browse/SPARK-4600
 Project: Spark
  Issue Type: Bug
  Components: GraphX
 Environment: scala 2.10.4
 spark 1.1.0
Reporter: Teppei Tosa
Assignee: Brennon York
  Labels: graphx

 VertexRDD.diff doesn't work.
 For example : 
 val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 2L).map(id = 
 (id, id.toInt)))
 setA.collect.foreach(println(_))
 // (0,0)
 // (1,1)
 val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(1L until 3L).map(id = 
 (id, id.toInt)))
 setB.collect.foreach(println(_))
 // (1,1)
 // (2,2)
 val diff = setA.diff(setB)
 diff.collect.foreach(println(_))
 // printed none



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5691) Preventing duplicate registering of an application has incorrect logic

2015-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312644#comment-14312644
 ] 

Apache Spark commented on SPARK-5691:
-

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/4477

 Preventing duplicate registering of an application has incorrect logic
 --

 Key: SPARK-5691
 URL: https://issues.apache.org/jira/browse/SPARK-5691
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.1, 1.2.0
Reporter: Matt Cheah
 Fix For: 1.3.0


 If an application registers twice with the Master, the Master accepts both 
 copies and they both show up in the UI and consume resources. This is 
 incorrect behavior.
 This happens inadvertently in regular usage when the Master is under high 
 load, but it boils down to: when an application times out registering with 
 the master and sends a second registration message, but the Master is still 
 alive, it processes the first registration message for the application but 
 also erroneously processes the second registration message instead of 
 discarding it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo

2015-02-09 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313254#comment-14313254
 ] 

Nicholas Chammas commented on SPARK-5676:
-

It ended up in Mesos because [Spark itself came out of 
Mesos|https://github.com/mesos/spark]. It's just history. The Mesos project 
does not manage that repo at all.

Anyway, we should add a license for clarity sake since we accept contributions 
there, regardless of whether those scripts are being distributed or not. I 
think we agree on that. If you don't want an issue to track that here, that's 
fine. No big deal either way, honestly.

 License missing from spark-ec2 repo
 ---

 Key: SPARK-5676
 URL: https://issues.apache.org/jira/browse/SPARK-5676
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein

 There is no LICENSE file or licence headers in the code in the spark-ec2 
 repo. Also, I believe there is no contributor license agreement notification 
 in place (like there is in the main spark repo).
 It would be great to fix this (sooner better than later while contributors 
 list is small), so that users wishing to use this part of Spark are not in 
 doubt over licensing issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo

2015-02-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313260#comment-14313260
 ] 

Sean Owen commented on SPARK-5676:
--

Ah you're saying it isn't even part of Mesos. I see why there's not obviously a 
good home in JIRA then. Can that repo just be managed as its own self-contained 
project then, using github issues?

 License missing from spark-ec2 repo
 ---

 Key: SPARK-5676
 URL: https://issues.apache.org/jira/browse/SPARK-5676
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein

 There is no LICENSE file or licence headers in the code in the spark-ec2 
 repo. Also, I believe there is no contributor license agreement notification 
 in place (like there is in the main spark repo).
 It would be great to fix this (sooner better than later while contributors 
 list is small), so that users wishing to use this part of Spark are not in 
 doubt over licensing issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo

2015-02-09 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313269#comment-14313269
 ] 

Shivaram Venkataraman commented on SPARK-5676:
--

Yes - it is managed as a self-contained project. However bugs in that project 
are often experienced by Spark users, so we end up with issues created here. I 
think filing these issues under the component EC2 is a fine thing to do as it 
does affect Spark usage on EC2.

 License missing from spark-ec2 repo
 ---

 Key: SPARK-5676
 URL: https://issues.apache.org/jira/browse/SPARK-5676
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein

 There is no LICENSE file or licence headers in the code in the spark-ec2 
 repo. Also, I believe there is no contributor license agreement notification 
 in place (like there is in the main spark repo).
 It would be great to fix this (sooner better than later while contributors 
 list is small), so that users wishing to use this part of Spark are not in 
 doubt over licensing issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5648) support alter ... unset tblproperties(key)

2015-02-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5648.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4424
[https://github.com/apache/spark/pull/4424]

 support alter ... unset tblproperties(key) 
 ---

 Key: SPARK-5648
 URL: https://issues.apache.org/jira/browse/SPARK-5648
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: DoingDone9
 Fix For: 1.3.0


 make hivecontext support unset tblproperties(key)
 like :
 alter view viewName unset tblproperties(k)
 alter table tableName unset tblproperties(k)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo

2015-02-09 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313271#comment-14313271
 ] 

Nicholas Chammas commented on SPARK-5676:
-

Yeah, AFAIK it has nothing to do with Mesos apart from that historical 
connection.

I think having it stand alone using GitHub issues is fine, though the question 
remains of what account it should fall under. Maybe spark-ec2/spark-ec2 (kinda 
like [boot2docker/boot2docker|https://github.com/boot2docker/boot2docker])? 
Anyway, this is a separate question from the one raised here. But yeah, it 
should probably be moved at some point.

 License missing from spark-ec2 repo
 ---

 Key: SPARK-5676
 URL: https://issues.apache.org/jira/browse/SPARK-5676
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein

 There is no LICENSE file or licence headers in the code in the spark-ec2 
 repo. Also, I believe there is no contributor license agreement notification 
 in place (like there is in the main spark repo).
 It would be great to fix this (sooner better than later while contributors 
 list is small), so that users wishing to use this part of Spark are not in 
 doubt over licensing issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >