from:"Sean Owen $JIRA$"

[jira] [Commented] (SPARK-3169) make-distribution.sh failed

2014-08-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105277#comment-14105277
 ] 

Sean Owen commented on SPARK-3169:
--

Same as https://issues.apache.org/jira/browse/SPARK-2798 ? it's resolving 
similar problems in the Flume build.

 make-distribution.sh failed
 ---

 Key: SPARK-3169
 URL: https://issues.apache.org/jira/browse/SPARK-3169
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Guoqiang Li
Priority: Blocker

 {code}./make-distribution.sh -Pyarn -Phadoop-2.3 -Phive-thriftserver -Phive 
 -Dhadoop.version=2.3.0 
 {code}
  =
 {noformat}
 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
 Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference. A 
 signature in TestSuiteBase.class refers to term dstream
 in package org.apache.spark.streaming which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling 
 TestSuiteBase.class.
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1449) Please delete old releases from mirroring system

2014-08-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105303#comment-14105303
 ] 

Sean Owen commented on SPARK-1449:
--

[~pwendell] can you or someone else on the PMC zap this one? should be 
straightforward.

 Please delete old releases from mirroring system
 

 Key: SPARK-1449
 URL: https://issues.apache.org/jira/browse/SPARK-1449
 Project: Spark
  Issue Type: Task
Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.9.1
Reporter: Sebb

 To reduce the load on the ASF mirrors, projects are required to delete old 
 releases [1]
 Please can you remove all non-current releases?
 Thanks!
 [Note that older releases are always available from the ASF archive server]
 Any links to older releases on download pages should first be adjusted to 
 point to the archive server.
 [1] http://www.apache.org/dev/release.html#when-to-archive



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3202) Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD

2014-08-25 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109128#comment-14109128
 ] 

Sean Owen commented on SPARK-3202:
--

JIRA is not a good place to ask questions -- please use u...@spark.apache.org. 
This is for reporting issues, so I'd recommend closing this.

 Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD
 -

 Key: SPARK-3202
 URL: https://issues.apache.org/jira/browse/SPARK-3202
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Hingorani, Vineet

 Hello all,
 Could someone help me with the manipulation of csv file data. I have 
 'semicolon' separated csv data including doubles and strings. I want to 
 calculate the maximum/average of a column. When I read the file using 
 sc.textFile(test.csv).map(_.split(;), each field is read as string. Could 
 someone help me with the above manipulation and how to do that.
 Or may be if there is some way to take the transpose of the data and then 
 manipulating the rows in some way?
 Thank you in advance, I am struggling with this thing for quite sometime
 Regards,
 Vineet



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2798) Correct several small errors in Flume module pom.xml files

2014-08-25 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109675#comment-14109675
 ] 

Sean Owen commented on SPARK-2798:
--

[~tdas] Cool, I think this closes SPARK-3169 too if I understand correctly

 Correct several small errors in Flume module pom.xml files
 --

 Key: SPARK-2798
 URL: https://issues.apache.org/jira/browse/SPARK-2798
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 (EDIT) Since the scalatest issue was since resolved, this is now about a few 
 small problems in the Flume Sink pom.xml 
 - scalatest is not declared as a test-scope dependency
 - Its Avro version doesn't match the rest of the build
 - Its Flume version is not synced with the other Flume module
 - The other Flume module declares its dependency on Flume Sink slightly 
 incorrectly, hard-coding the Scala 2.10 version
 - It depends on Scala Lang directly, which it shouldn't



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-08-26 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110463#comment-14110463
 ] 

Sean Owen commented on SPARK-3098:
--

[~matei] The question isn't whether distinct returns a particular ordering, or 
whether zipWithIndex assigns particular indices, but whether they would result 
in the same ordering and same assignments every time the RDD is evaluated:

{code}
 val c = {...}.distinct().zipWithIndex() 
c.join(c).filter(t = t._2._1 != t._2._2)
{code}

If so, then the same values should map to the same indices, and the self-join 
of c to itself should always pair the same value with itself. Regardless of 
what those un-guaranteed values are they should be the same since it's the very 
same RDD. If not, obviously that explains the behavior then.

The behavior at first glance had also surprised me, since I had taken RDDs to 
be deterministic and transparently recomputable on demand. That is the 
important first question -- is that supposed to be so or not?

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 The reproduce code:
 {code}
  val c = sc.parallelize(1 to 7899).flatMap { i =
   (1 to 1).toSeq.map(p = i * 6000 + p)
 }.distinct().zipWithIndex() 
 c.join(c).filter(t = t._2._1 != t._2._2).take(3)
 {code}
  = 
 {code}
  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
 (36579712,(13,14)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3276) Provide a API to specify whether the old files need to be ignored in file input text DStream

2014-08-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113704#comment-14113704
 ] 

Sean Owen commented on SPARK-3276:
--

Given the nature of a stream processing framework, when would you want to keep 
reprocessing all old data? that is something you can do, but, doesn't require 
Spark Streaming

 Provide a API to specify whether the old files need to be ignored in file 
 input text DStream
 

 Key: SPARK-3276
 URL: https://issues.apache.org/jira/browse/SPARK-3276
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Jack Hu

 Currently, only one API called textFileStream in StreamingContext to specify 
 the text file dstream, which ignores the old files always. On some times, the 
 old files is still useful.
 Need a API to let user choose whether the old files need to be ingored or not 
 .
 The API currently in StreamingContext:
 def textFileStream(directory: String): DStream[String] = {
 fileStream[LongWritable, Text, 
 TextInputFormat](directory).map(_._2.toString)
   }



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3274) Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream

2014-08-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113702#comment-14113702
 ] 

Sean Owen commented on SPARK-3274:
--

Same as the problem and solution in 
https://issues.apache.org/jira/browse/SPARK-1040

 Spark Streaming Java API reports java.lang.ClassCastException when calling 
 collectAsMap on JavaPairDStream
 --

 Key: SPARK-3274
 URL: https://issues.apache.org/jira/browse/SPARK-3274
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2
Reporter: Jack Hu

 Reproduce code:
 scontext
   .socketTextStream(localhost, 1)
   .mapToPair(new PairFunctionString, String, String(){
   public Tuple2String, String call(String arg0)
   throws Exception {
   return new Tuple2String, String(1, arg0);
   }
   })
   .foreachRDD(new Function2JavaPairRDDString, String, Time, 
 Void() {
   public Void call(JavaPairRDDString, String v1, Time 
 v2) throws Exception {
   System.out.println(v2.toString() + :  + 
 v1.collectAsMap().toString());
   return null;
   }
   });
 Exception:
 java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
 [Lscala.Tupl
 e2;
 at 
 org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.s
 cala:447)
 at 
 org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala:
 464)
 at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:90)
 at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:88)
 at 
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR
 DD$2.apply(JavaDStreamLike.scala:282)
 at 
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR
 DD$2.apply(JavaDStreamLike.scala:282)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mc
 V$sp(ForEachDStream.scala:41)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo
 rEachDStream.scala:40)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo
 rEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
 at 
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobS



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113710#comment-14113710
 ] 

Sean Owen commented on SPARK-3266:
--

The method is declared in the superclass, JavaRDDLike: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala#L538

You are running a different version of Spark than you are compiling with, and 
the runtime version is perhaps too old to contain this method. This is not a 
Spark issue.

 JavaDoubleRDD doesn't contain max()
 ---

 Key: SPARK-3266
 URL: https://issues.apache.org/jira/browse/SPARK-3266
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.1
Reporter: Amey Chaugule

 While I can compile my code, I see:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
 When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
 don't notice max()
 although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114391#comment-14114391
 ] 

Sean Owen commented on SPARK-3266:
--

(Mea culpa! The example shows this is a legitimate question. I'll be quiet now.)

 JavaDoubleRDD doesn't contain max()
 ---

 Key: SPARK-3266
 URL: https://issues.apache.org/jira/browse/SPARK-3266
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.1, 1.0.2
Reporter: Amey Chaugule
Assignee: Josh Rosen
 Attachments: spark-repro-3266.tar.gz


 While I can compile my code, I see:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
 When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
 don't notice max()
 although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3292) Shuffle Tasks run indefinitely even though there's no inputs

2014-08-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114829#comment-14114829
 ] 

Sean Owen commented on SPARK-3292:
--

Can you elaborate this? it's not clear whether you're reporting that the 
process hangs, runs slowly, or creates too many files.

 Shuffle Tasks run indefinitely even though there's no inputs
 

 Key: SPARK-3292
 URL: https://issues.apache.org/jira/browse/SPARK-3292
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2
Reporter: guowei

 such as repartition groupby join and cogroup
 it's too expensive , for example if i want outputs save as hadoop file ,then 
 many emtpy file generate.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3302) The wrong version information in SparkContext

2014-08-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14115126#comment-14115126
 ] 

Sean Owen commented on SPARK-3302:
--

This duplicates at least one of 
https://issues.apache.org/jira/browse/SPARK-2697 or 
https://issues.apache.org/jira/browse/SPARK-3273

 The wrong version information in SparkContext
 -

 Key: SPARK-3302
 URL: https://issues.apache.org/jira/browse/SPARK-3302
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Guoqiang Li
Assignee: Guoqiang Li
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ

2014-08-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116278#comment-14116278
 ] 

Sean Owen commented on SPARK-3324:
--

I agree, I've had a similar problem and just resolved it manually.

I imagine the answer is, soon we're going to delete alpha anyway and then 
this is moot.

 YARN module has nonstandard structure which cause compile error In IntelliJ
 ---

 Key: SPARK-3324
 URL: https://issues.apache.org/jira/browse/SPARK-3324
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
 Environment: Mac OS: 10.9.4
 IntelliJ IDEA: 13.1.4
 Scala Plugins: 0.41.2
 Maven: 3.0.5
Reporter: Yi Tian
Priority: Minor
  Labels: intellij, maven, yarn

 The YARN module has nonstandard path structure like:
 {code}
 ${SPARK_HOME}
   |--yarn
  |--alpha (contains yarn api support for 0.23 and 2.0.x)
  |--stable (contains yarn api support for 2.2 and later)
  | |--pom.xml (spark-yarn)
  |--common (Common codes not depending on specific version of Hadoop)
  |--pom.xml (yarn-parent)
 {code}
 When we use maven to compile yarn module, maven will import 'alpha' or 
 'stable' module according to profile setting.
 And the submodule like 'stable' use the build propertie defined in 
 yarn/pom.xml to import common codes to sourcePath.
 It will cause IntelliJ can't directly recogniz codes in common directory as 
 sourcePath. 
 I thought we should change the yarn module to a unified maven jar project, 
 and choose the version of yarn api via maven profile setting in the pom.xml.
 It will resolve the compile error in IntelliJ and make the yarn module more 
 simple and clear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3326) can't access a static variable after init in mapper

2014-08-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116327#comment-14116327
 ] 

Sean Owen commented on SPARK-3326:
--

The call to Foo.getSome() occurs remotely, on a different JVM with a different 
copy of your class. You may initialize your instance in the driver, but this 
leaves it uninitialized in the remote workers.

You can initialize this in a static block. Or you can simply reference the 
value of Foo.getSome() directly in your map function and then it is serialized 
in the closure. All that you send right now is a function that depends on what 
Foo.getSome() returns when it's called, not what it happens to return on the 
driver. Consider broadcast variables if it's large.

If that's what's going on then this is normal behavior.

 can't access a static variable after init in mapper
 ---

 Key: SPARK-3326
 URL: https://issues.apache.org/jira/browse/SPARK-3326
 Project: Spark
  Issue Type: Bug
 Environment: CDH5.1.0
 Spark1.0.0
Reporter: Gavin Zhang

 I wrote a object like:
 object Foo {
private Bar bar = null
def init(Bar bar){
this.bar = bar
}
def getSome(){
bar.someDef()
}
 }
 In Spark main def, I read some text from HDFS and init this object. And after 
 then calling getSome().
 I was successful with this code:
 sc.textFile(args(0)).take(10).map(println(Foo.getSome()))
 However, when I changed it for write output to HDFS, I found the bar variable 
 in Foo object is null:
 sc.textFile(args(0)).map(line=Foo.getSome()).saveAsTextFile(args(1))
 WHY?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ

2014-08-31 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116667#comment-14116667
]

Sean Owen commented on SPARK-3324:
--

Let me try to sketch what's funky about the structure. We have yarn/alpha,
yarn/common, yarn/stable. Understanding the purpose, I would expect each to be
a module, and that each has a src/ directory, and that alpha and stable depend
on common, and the Spark parent activates either yarn/alpha or yarn/stable
depending on profiles. IntelliJ is fine with that.

However what we have is that yarn/ is a module. But its source is in
yarn/common. But it's a pom-only module. And yarn/alpha and yarn/stable list it
as the parent and inherit all of their source directory info and dependencies
from yarn/, which is not itself a module of code. So each compiles two source
directories defined in different places. This plus profiles confused IntelliJ
and required manual intervention.

Maybe I overlook a reason this had to be done, but rejiggering this as three
simple modules should work.
Again I imagine the question is, is it worth it versus removing yarn/alpha at
some point in the future? Because it's trivial to fix how IntelliJ reads the
POMs once by hand in the IDE.

YARN module has nonstandard structure which cause compile error In IntelliJ
---

Key: SPARK-3324
URL: https://issues.apache.org/jira/browse/SPARK-3324
Project: Spark
Issue Type: Bug
Components: YARN
Affects Versions: 1.1.0
Environment: Mac OS: 10.9.4
IntelliJ IDEA: 13.1.4
Scala Plugins: 0.41.2
Maven: 3.0.5
Reporter: Yi Tian
Priority: Minor
Labels: intellij, maven, yarn

The YARN module has nonstandard path structure like:
{code}
${SPARK_HOME}
|--yarn
|--alpha (contains yarn api support for 0.23 and 2.0.x)
|--stable (contains yarn api support for 2.2 and later)
| |--pom.xml (spark-yarn)
|--common (Common codes not depending on specific version of Hadoop)
|--pom.xml (yarn-parent)
{code}
When we use maven to compile yarn module, maven will import 'alpha' or
'stable' module according to profile setting.
And the submodule like 'stable' use the build propertie defined in
yarn/pom.xml to import common codes to sourcePath.
It will cause IntelliJ can't directly recognize sources in common directory
as sourcePath.
I thought we should change the yarn module to a unified maven jar project,
and add specify different version of yarn api via maven profile setting.
It will resolve the compile error in IntelliJ and make the yarn module more
simple and clear.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ

2014-08-31 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116801#comment-14116801
 ] 

Sean Owen commented on SPARK-3324:
--

[~tianyi] I seem to remember having a similar problem. I think that is 
straightforward to fix. It's a separate issue. But FWIW I would like to see 
that improved.

 YARN module has nonstandard structure which cause compile error In IntelliJ
 ---

 Key: SPARK-3324
 URL: https://issues.apache.org/jira/browse/SPARK-3324
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
 Environment: Mac OS: 10.9.4
 IntelliJ IDEA: 13.1.4
 Scala Plugins: 0.41.2
 Maven: 3.0.5
Reporter: Yi Tian
Priority: Minor
  Labels: intellij, maven, yarn

 The YARN module has nonstandard path structure like:
 {code}
 ${SPARK_HOME}
   |--yarn
  |--alpha (contains yarn api support for 0.23 and 2.0.x)
  |--stable (contains yarn api support for 2.2 and later)
  | |--pom.xml (spark-yarn)
  |--common (Common codes not depending on specific version of Hadoop)
  |--pom.xml (yarn-parent)
 {code}
 When we use maven to compile yarn module, maven will import 'alpha' or 
 'stable' module according to profile setting.
 And the submodule like 'stable' use the build propertie defined in 
 yarn/pom.xml to import common codes to sourcePath.
 It will cause IntelliJ can't directly recognize sources in common directory 
 as sourcePath. 
 I thought we should change the yarn module to a unified maven jar project, 
 and add specify different version of yarn api via maven profile setting.
 It will resolve the compile error in IntelliJ and make the yarn module more 
 simple and clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3330) Successive test runs with different profiles fail SparkSubmitSuite

2014-08-31 Thread Sean Owen (JIRA)

Sean Owen created SPARK-3330:


 Summary: Successive test runs with different profiles fail 
SparkSubmitSuite
 Key: SPARK-3330
 URL: https://issues.apache.org/jira/browse/SPARK-3330
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen


Maven-based Jenkins builds have been failing for a while:
https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/480/HADOOP_PROFILE=hadoop-2.4,label=centos/console

One common cause is that on the second and subsequent runs of mvn clean test, 
at least two assembly JARs will exist in assembly/target. Because assembly is 
not a submodule of parent, mvn clean is not invoked for assembly. The 
presence of two assembly jars causes spark-submit to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3331) PEP8 tests fail in release because they check unzipped py4j code

2014-08-31 Thread Sean Owen (JIRA)

Sean Owen created SPARK-3331:


 Summary: PEP8 tests fail in release because they check unzipped 
py4j code
 Key: SPARK-3331
 URL: https://issues.apache.org/jira/browse/SPARK-3331
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen
Priority: Minor


PEP8 tests run on files under ./python, but in the release packaging, py4j 
code is present in ./python/build/py4j. Py4J code fails style checks and thus 
release fails ./dev/run-tests now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3331) PEP8 tests fail because they check unzipped py4j code

2014-08-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3331:
-
Summary: PEP8 tests fail because they check unzipped py4j code  (was: PEP8 
tests fail in release because they check unzipped py4j code)

 PEP8 tests fail because they check unzipped py4j code
 -

 Key: SPARK-3331
 URL: https://issues.apache.org/jira/browse/SPARK-3331
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen
Priority: Minor

 PEP8 tests run on files under ./python, but in the release packaging, py4j 
 code is present in ./python/build/py4j. Py4J code fails style checks and 
 thus release fails ./dev/run-tests now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3331) PEP8 tests fail because they check unzipped py4j code

2014-08-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3331:
-
Description: PEP8 tests run on files under ./python, but unzipped py4j 
code is found at ./python/build/py4j. Py4J code fails style checks and can 
fail ./dev/run-tests if this code is present locally.  (was: PEP8 tests run on 
files under ./python, but in the release packaging, py4j code is present in 
./python/build/py4j. Py4J code fails style checks and thus release fails 
./dev/run-tests now.)

 PEP8 tests fail because they check unzipped py4j code
 -

 Key: SPARK-3331
 URL: https://issues.apache.org/jira/browse/SPARK-3331
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen
Priority: Minor

 PEP8 tests run on files under ./python, but unzipped py4j code is found at 
 ./python/build/py4j. Py4J code fails style checks and can fail 
 ./dev/run-tests if this code is present locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3330) Successive test runs with different profiles fail SparkSubmitSuite

2014-09-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3330.
--
Resolution: Won't Fix

It will be more suitable to change jenkins to run mvn clean  mvn ... 
package to address this, if anything. It's not yet clear this is the cause of 
the failure in the same test in Jenkins though.

 Successive test runs with different profiles fail SparkSubmitSuite
 --

 Key: SPARK-3330
 URL: https://issues.apache.org/jira/browse/SPARK-3330
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen

 Maven-based Jenkins builds have been failing for a while:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/480/HADOOP_PROFILE=hadoop-2.4,label=centos/console
 One common cause is that on the second and subsequent runs of mvn clean 
 test, at least two assembly JARs will exist in assembly/target. Because 
 assembly is not a submodule of parent, mvn clean is not invoked for 
 assembly. The presence of two assembly jars causes spark-submit to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator

2014-09-03 Thread Sean Owen (JIRA)

Sean Owen created SPARK-3369:


 Summary: Java mapPartitions Iterator-Iterable is inconsistent 
with Scala's Iterator-Iterator
 Key: SPARK-3369
 URL: https://issues.apache.org/jira/browse/SPARK-3369
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.0.2
Reporter: Sean Owen


{{mapPartitions}} in the Scala RDD API takes a function that transforms an 
{{Iterator}} to an {{Iterator}}: 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an 
{{Iterator}} but is requires to return an {{Iterable}}, which is a stronger 
condition and appears inconsistent. It's a problematic inconsistent though 
because this seems to require copying all of the input into memory in order to 
create an object that can be iterated many times, since the input does not 
afford this itself.

Similarity for other {{mapPartitions*}} methods and other 
{{*FlatMapFunctions}}s in Java.

(Is there a reason for this difference that I'm overlooking?)

If I'm right that this was inadvertent inconsistency, then the big issue here 
is that of course this is part of a public API. Workarounds I can think of:

Promise that Spark will only call {{iterator()}} once, so implementors can use 
a hacky {{IteratorIterable}} that returns the same {{Iterator}}.

Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-563) Run findBugs and IDEA inspections in the codebase

2014-09-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-563.
-
Resolution: Won't Fix

This appears to be obsolete/stale too.

 Run findBugs and IDEA inspections in the codebase
 -

 Key: SPARK-563
 URL: https://issues.apache.org/jira/browse/SPARK-563
 Project: Spark
  Issue Type: Improvement
Reporter: Ismael Juma

 I ran into a few instances of unused local variables and unnecessary usage of 
 the 'return' keyword (the recommended practice is to avoid 'return' if 
 possible) and thought it would be good to run findBugs and IDEA inspections 
 to clean-up the code.
 I am willing to do this, but first would like to know whether you agree that 
 this is a good idea and whether this is the right time to do it. These 
 changes tend to affect many source files and can cause issues if there is 
 major work ongoing in separate branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans

2014-09-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120314#comment-14120314
 ] 

Sean Owen commented on SPARK-3384:
--

[~rnowling] I think it fairly important for speed in that section of code 
though. Using mutable data structures is not a problem if done correctly and 
for the right reason.

 Potential thread unsafe Breeze vector addition in KMeans
 

 Key: SPARK-3384
 URL: https://issues.apache.org/jira/browse/SPARK-3384
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: RJ Nowling

 In the KMeans clustering implementation, the Breeze vectors are accumulated 
 using +=.  For example,
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162
  This is potentially a thread unsafe operation.  (This is what I observed in 
 local testing.)  I suggest changing the += to + -- a new object will be 
 allocated but it will be thread safe since it won't write to an old location 
 accessed by multiple threads.
 Further testing is required to reproduce and verify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-640) Update Hadoop 1 version to 1.1.0 (especially on AMIs)

2014-09-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-640.
-
Resolution: Fixed

This looks stale right? Hadoop 1 version has been at 1.2.1 for some time.

 Update Hadoop 1 version to 1.1.0 (especially on AMIs)
 -

 Key: SPARK-640
 URL: https://issues.apache.org/jira/browse/SPARK-640
 Project: Spark
  Issue Type: New Feature
Reporter: Matei Zaharia

 Hadoop 1.1.0 has a fix to the notorious trailing slash for directory objects 
 in S3 issue: https://issues.apache.org/jira/browse/HADOOP-5836, so would be 
 good to support on the AMIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-529) Have a single file that controls the environmental variables and spark config options

2014-09-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-529.
-
Resolution: Won't Fix

This looks obsolete and/or fixed, as variables like SPARK_MEM are deprecated, 
and I suppose there is spark-env.sh too.

 Have a single file that controls the environmental variables and spark config 
 options
 -

 Key: SPARK-529
 URL: https://issues.apache.org/jira/browse/SPARK-529
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin

 E.g. multiple places in the code base uses SPARK_MEM and has its own default 
 set to 512. We need a central place to enforce default values as well as 
 documenting the variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3404) SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1

2014-09-04 Thread Sean Owen (JIRA)

Sean Owen created SPARK-3404:


 Summary: SparkSubmitSuite fails in Maven (only) - spark-submit 
exits with code 1
 Key: SPARK-3404
 URL: https://issues.apache.org/jira/browse/SPARK-3404
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen


Maven-based Jenkins builds have been failing for over a month. For example:
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/

It's SparkSubmitSuite that fails. For example:
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull

{code}
SparkSubmitSuite
...
- launch simple application with spark-submit *** FAILED ***
  org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
  at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
  at 
org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
- spark submit includes jars passed in through --jar *** FAILED ***
  org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
local-cluster[2,1,512], --jars, 
file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
 file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
  at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
  at 
org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
{code}

SBT builds don't fail, so it is likely to be due to some difference in how the 
tests are run rather than a problem with test or core project.

This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
cause identified in that JIRA is, at least, not the only cause. (Although, it 
wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
config to invoke {{mvn clean  mvn ... package}} {{mvn ... clean package}}.)

This JIRA tracks investigation into a different cause. Right now I have some 
further information but not a PR yet.

Part of the issue is that there is no clue in the log about why 
{{spark-submit}} exited with status 1. See 
https://github.com/apache/spark/pull/2108/files and 
https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
least print stdout to the log too.

The SparkSubmit program exits with 1 when the main class it is supposed to run 
is not found 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
 This is for example SimpleApplicationTest 
(https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)

The test actually submits an empty JAR not containing this class. It relies on 
{{spark-submit}} finding the class within the compiled test-classes of the 
Spark project. However it does seem to be compiled and present even with Maven.

If modified to print stdout and stderr, and dump the actual command, I see an 
empty stdout, and only the command to stderr:

{code}
Spark Command: 
/Library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home/bin/java -cp

[jira] [Commented] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1

2014-09-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121653#comment-14121653
 ] 

Sean Owen commented on SPARK-3404:
--

It's 100% repeatable in Maven for me locally, which seems to be Jenkins' 
experience too. I don't see the same problem with SBT (/dev/run-tests) locally, 
although I can't say I run that regularly.

I could rewrite the SparkSubmitSuite to submit a JAR file that actually 
contains the class it's trying to invoke. Maybe that's smarter? the problem 
here seems to be the vagaries of what the run-time classpath is during an SBT 
vs Maven test. Would anyone second that?

Separately it would probably not hurt to get in that change that logs stdout / 
stderr from the Utils method.

 SparkSubmitSuite fails with spark-submit exits with code 1
 

 Key: SPARK-3404
 URL: https://issues.apache.org/jira/browse/SPARK-3404
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
Reporter: Sean Owen
Priority: Critical

 Maven-based Jenkins builds have been failing for over a month. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
 It's SparkSubmitSuite that fails. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
 {code}
 SparkSubmitSuite
 ...
 - launch simple application with spark-submit *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
 local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 - spark submit includes jars passed in through --jar *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
 local-cluster[2,1,512], --jars, 
 file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 {code}
 SBT builds don't fail, so it is likely to be due to some difference in how 
 the tests are run rather than a problem with test or core project.
 This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
 cause identified in that JIRA is, at least, not the only cause. (Although, it 
 wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
 config to invoke {{mvn clean  mvn ... package}} {{mvn ... clean package}}.)
 This JIRA tracks investigation into a different cause. Right now I have some 
 further information but not a PR yet.
 Part of the issue is that there is no clue in the log about why 
 {{spark-submit}} exited with status 1. See 
 https://github.com/apache/spark/pull/2108/files and 
 https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
 least print stdout to the log too.
 The SparkSubmit program exits with 1 when the main class it is supposed to 
 run is not found

[jira] [Commented] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator

2014-09-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122921#comment-14122921
 ] 

Sean Owen commented on SPARK-3369:
--

The API change is unlikely to happen. Making a bunch of flatMap2 methods is 
really ugly.

I suppose you could try wrapper the Iterator in this:

{code}
public class IteratorIterableT implements IterableT {
  
  private final IteratorT iterator;
  private boolean consumed;
  
  public IteratorIterable(IteratorT iterator) {
this.iterator = iterator;
  }
  
  @Override
  public IteratorT iterator() {
if (consumed) {
  throw new IllegalStateException(Iterator already consumed);
}
consumed = true;
return iterator;
  }
}
{code}

If, as I suspect, Spark actually only calls iterator() once, this will work, 
and this may be the most tolerable workaround until Spark 2.x. If it doesn't 
work, and iterator() is called multiple times, this will fail fast and at least 
we'd know. Can you try something like this?

 Java mapPartitions Iterator-Iterable is inconsistent with Scala's 
 Iterator-Iterator
 -

 Key: SPARK-3369
 URL: https://issues.apache.org/jira/browse/SPARK-3369
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.0.2
Reporter: Sean Owen
 Attachments: FlatMapIterator.patch


 {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
 {{Iterator}} to an {{Iterator}}: 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
 In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
 an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
 stronger condition and appears inconsistent. It's a problematic inconsistent 
 though because this seems to require copying all of the input into memory in 
 order to create an object that can be iterated many times, since the input 
 does not afford this itself.
 Similarity for other {{mapPartitions*}} methods and other 
 {{*FlatMapFunctions}}s in Java.
 (Is there a reason for this difference that I'm overlooking?)
 If I'm right that this was inadvertent inconsistency, then the big issue here 
 is that of course this is part of a public API. Workarounds I can think of:
 Promise that Spark will only call {{iterator()}} once, so implementors can 
 use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
 Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
 desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3442) Create LengthBoundedInputStream

2014-09-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125948#comment-14125948
 ] 

Sean Owen commented on SPARK-3442:
--

This exists in Guava as LimitInputStream until Guava 14, and as 
ByteStreams.limit() from 14 onwards:

https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/io/LimitInputStream.java?spec=svn08d3526fc19293cf099e0c50dbf3bbc915f2e3f2r=08d3526fc19293cf099e0c50dbf3bbc915f2e3f2
http://docs.guava-libraries.googlecode.com/git-history/v14.0/javadoc/com/google/common/io/ByteStreams.html#limit(java.io.InputStream,
 long)

It would be nice to reuse but you can see the version problem there. At least, 
maybe something that can be lifted and adapted.

 Create LengthBoundedInputStream
 ---

 Key: SPARK-3442
 URL: https://issues.apache.org/jira/browse/SPARK-3442
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 To create a LengthBoundedInputStream, which is an InputStream decorator that 
 limits the number of bytes returned from an underlying InputStream. 
 This can be used to create an InputStream directly from a segment of a file 
 (FileInputStream always reads till EOF).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1

2014-09-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3404.
--
   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

Tests are now failing due to HiveQL test problems, but you can see they have 
passed SparkSubmitSuite:
https://amplab.cs.berkeley.edu/jenkins/view/Spark/

I think this one's resolved now.

 SparkSubmitSuite fails with spark-submit exits with code 1
 

 Key: SPARK-3404
 URL: https://issues.apache.org/jira/browse/SPARK-3404
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
Reporter: Sean Owen
Priority: Critical
 Fix For: 1.1.1, 1.2.0


 Maven-based Jenkins builds have been failing for over a month. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
 It's SparkSubmitSuite that fails. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
 {code}
 SparkSubmitSuite
 ...
 - launch simple application with spark-submit *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
 local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 - spark submit includes jars passed in through --jar *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
 local-cluster[2,1,512], --jars, 
 file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 {code}
 SBT builds don't fail, so it is likely to be due to some difference in how 
 the tests are run rather than a problem with test or core project.
 This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
 cause identified in that JIRA is, at least, not the only cause. (Although, it 
 wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
 config to invoke {{mvn clean  mvn ... package}} {{mvn ... clean package}}.)
 This JIRA tracks investigation into a different cause. Right now I have some 
 further information but not a PR yet.
 Part of the issue is that there is no clue in the log about why 
 {{spark-submit}} exited with status 1. See 
 https://github.com/apache/spark/pull/2108/files and 
 https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
 least print stdout to the log too.
 The SparkSubmit program exits with 1 when the main class it is supposed to 
 run is not found 
 (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
  This is for example SimpleApplicationTest 
 (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)
 The test actually submits an empty JAR not containing this class. It relies 
 on {{spark-submit}} finding the class within the compiled test-classes of the

[jira] [Commented] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable

2014-09-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128232#comment-14128232
 ] 

Sean Owen commented on SPARK-3470:
--

If you implement {{AutoCloseable}}, then Spark will not work on Java 6, since 
this class does not exist before Java 7. Implementing {{Closeable}} is fine of 
course. I assume it would just call {{stop()}}

 Have JavaSparkContext implement Closeable/AutoCloseable
 ---

 Key: SPARK-3470
 URL: https://issues.apache.org/jira/browse/SPARK-3470
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Shay Rojansky
Priority: Minor

 After discussion in SPARK-2972, it seems like a good idea to allow Java 
 developers to use Java 7 automatic resource management with JavaSparkContext, 
 like so:
 {code:java}
 try (JavaSparkContext ctx = new JavaSparkContext(...)) {
return br.readLine();
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3474) Rename the env variable SPARK_MASTER_IP to SPARK_MASTER_HOST

2014-09-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128300#comment-14128300
 ] 

Sean Owen commented on SPARK-3474:
--

(You can deprecate but still support old variable names, right? so 
SPARK_MASTER_IP has the effect of setting new SPARK_MASTER_HOST but generates a 
warning. You wouldn't want to or need to remove old vars immediately.)

 Rename the env variable SPARK_MASTER_IP to SPARK_MASTER_HOST
 

 Key: SPARK-3474
 URL: https://issues.apache.org/jira/browse/SPARK-3474
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.1
Reporter: Chunjun Xiao

 There's some inconsistency regarding the env variable used to specify the 
 spark master host server.
 In spark source code (MasterArguments.scala), the env variable is 
 SPARK_MASTER_HOST, while in the shell script (e.g., spark-env.sh, 
 start-master.sh), it's named SPARK_MASTER_IP.
 This will introduce an issue in some case, e.g., if spark master is started  
 via service spark-master start, which is built based on latest bigtop 
 (refer to bigtop/spark-master.svc).
 In this case, SPARK_MASTER_IP will have no effect.
 I suggest we change SPARK_MASTER_IP in the shell script to SPARK_MASTER_HOST.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable

2014-09-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128764#comment-14128764
 ] 

Sean Owen commented on SPARK-3470:
--

Spark retains compatibility with Java 6 on purpose AFAIK. But implementing 
Closeable is fine and also works with try-with-resources in Java 7, yes.

 Have JavaSparkContext implement Closeable/AutoCloseable
 ---

 Key: SPARK-3470
 URL: https://issues.apache.org/jira/browse/SPARK-3470
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Shay Rojansky
Priority: Minor

 After discussion in SPARK-2972, it seems like a good idea to allow Java 
 developers to use Java 7 automatic resource management with JavaSparkContext, 
 like so:
 {code:java}
 try (JavaSparkContext ctx = new JavaSparkContext(...)) {
return br.readLine();
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-558) Simplify run script by relying on sbt to launch app

2014-09-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129798#comment-14129798
 ] 

Sean Owen commented on SPARK-558:
-

Is this stale too? given that SBT is less used, I don't imagine the run scripts 
will start relying on it for classpath generation.

 Simplify run script by relying on sbt to launch app
 ---

 Key: SPARK-558
 URL: https://issues.apache.org/jira/browse/SPARK-558
 Project: Spark
  Issue Type: Improvement
Reporter: Ismael Juma

 The run script replicates SBT's functionality in order to build the classpath.
 This could be avoided by creating a task in sbt that is responsible for 
 calling the appropriate main method, configuring the environment variables 
 from the script and then invoking sbt with the task name and arguments.
 Is there a reason why we should not do this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-683) Spark 0.7 with Hadoop 1.0 does not work with current AMI's HDFS installation

2014-09-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-683.
-
Resolution: Fixed

I think this is likely long since obsolete or fixed, since Spark, Hadoop and 
AMI Hadoop versions have moved forward, and have not heard of this issue in 
recent memory.

 Spark 0.7 with Hadoop 1.0 does not work with current AMI's HDFS installation
 

 Key: SPARK-683
 URL: https://issues.apache.org/jira/browse/SPARK-683
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 0.7.0
Reporter: Tathagata Das

 A simple saveAsObjectFile() leads to the following error.
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: 
 java.lang.NoSuchMethodException: 
 org.apache.hadoop.hdfs.protocol.ClientProtocol.create(java.lang.String, 
 org.apache.hadoop.fs.permission.FsPermission, java.lang.String, boolean, 
 boolean, short, long)
   at java.lang.Class.getMethod(Class.java:1622)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:416)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-880) When built with Hadoop2, spark-shell and examples don't initialize log4j properly

2014-09-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129803#comment-14129803
 ] 

Sean Owen commented on SPARK-880:
-

This should be resolved/obsoleted by subsequent updates to SLF4J and log4j 
integration, and the props file.

 When built with Hadoop2, spark-shell and examples don't initialize log4j 
 properly
 -

 Key: SPARK-880
 URL: https://issues.apache.org/jira/browse/SPARK-880
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia

 They print this:
 {code}
 log4j:WARN No appenders could be found for logger 
 (akka.event.slf4j.Slf4jEventHandler).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
 info.
 {code}
 It might have to do with not finding a log4j.properties file. I believe 
 hadoop1 had one in its own JARs (or depended on an older log4j that came with 
 a default)? but hadoop2 doesn't. We should probably have our own default one 
 in conf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3520) java version check in spark-class fails with openjdk

2014-09-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133226#comment-14133226
 ] 

Sean Owen commented on SPARK-3520:
--

Duplicated / subsumed by https://issues.apache.org/jira/browse/SPARK-3425

 java version check in spark-class fails with openjdk
 

 Key: SPARK-3520
 URL: https://issues.apache.org/jira/browse/SPARK-3520
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
 Environment: Freebsd 10.1, Openjdk 7
Reporter: Radim Kolar
Priority: Minor

 tested on current git master:
 (hsn@sanatana:pts/4):spark/bin% ./spark-shell
 /home/hsn/live/spark/bin/spark-class: line 111: [: openjdk version 
 1.7.0_65: integer expression expected
 (hsn@sanatana:pts/4):spark/bin% java -version
 openjdk version 1.7.0_65
 OpenJDK Runtime Environment (build 1.7.0_65-b17)
 OpenJDK Server VM (build 24.65-b04, mixed mode)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3521) Missing modules in 1.1.0 source distribution - cant be build with maven

2014-09-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133413#comment-14133413
 ] 

Sean Owen commented on SPARK-3521:
--

https://dist.apache.org/repos/dist/release/spark/spark-1.1.0/spark-1.1.0.tgz
All of that source code is plainly in the distribution. It compiles with Maven 
for me an this was verified by several people during the release. It sounds 
like something is quite corrupted about your copy.

 Missing modules in 1.1.0 source distribution - cant be build with maven
 ---

 Key: SPARK-3521
 URL: https://issues.apache.org/jira/browse/SPARK-3521
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Radim Kolar
Priority: Minor

 modules {{bagel}}, {{mllib}}, {{flume-sink}} and {{flume}} are missing from 
 source code distro, spark cant be build with maven. It cant be build by 
 {{sbt/sbt}} either due to other bug (_java.lang.IllegalStateException: 
 impossible to get artifacts when data has not been loaded. IvyNode = 
 org.slf4j#slf4j-api;1.6.1_)
 (hsn@sanatana:pts/6):work/spark-1.1.0% mvn -Pyarn -Phadoop-2.4 
 -Dhadoop.version=2.4.1 -DskipTests clean package
 [INFO] Scanning for projects...
 [ERROR] The build could not read 1 project - [Help 1]
 [ERROR]   
 [ERROR]   The project org.apache.spark:spark-parent:1.1.0 
 (/home/hsn/myports/spark11/work/spark-1.1.0/pom.xml) has 4 errors
 [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/bagel of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist
 [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/mllib of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist
 [ERROR] Child module 
 /home/hsn/myports/spark11/work/spark-1.1.0/external/flume of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist
 [ERROR] Child module 
 /home/hsn/myports/spark11/work/spark-1.1.0/external/flume-sink/pom.xml of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3530) Pipeline and Parameters

2014-09-15 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133760#comment-14133760
]

Sean Owen commented on SPARK-3530:
--

A few high-level questions:

Is this a rewrite of MLlib? I see the old code will be deprecated. I assume the
algorithms will come along, but in a fairly different form. I think that's
actually a good thing. But is this targeted at a 2.x release, or sooner?

How does this relate to MLI and MLbase? I had thought they would in theory
handle things like grid-search, but haven't seen activity or mention of these
in a while. Is this at all a merge of the two or is MLlib going to take over
these concerns?

I don't think you will need or want to use this code, but the oryx project
already has an implementation of grid search on Spark. At least another take on
the API for such a thing to consider.
https://github.com/OryxProject/oryx/tree/master/oryx-ml/src/main/java/com/cloudera/oryx/ml/param

Big +1 for parameter tuning. That belongs as a first-class citizen. I'm also
intrigued by doing better than trying every possible combination of parameters
separately, and maybe sharing partial results to speed up several models'
training. Is this realistic for any parameters besides things like #
iterations? which isn't really a hyperparam. I don't know, for example, ways to
build N models with N different overfitting params and share some work. I would
love to know that's possible. Good to design for it anyway.

I see mention of a Dataset abstraction, which I'm assuming contains some type
information, like distinguishing categorical and numeric features. I think
that's very good!

I've always found the 'pipeline' part hard to build. It's tempting to construct
a framework for feature extraction. To some degree you can by providing
transformations, 1-hot encoding, etc. But I think that a framework for
understanding arbitrary databases and fields and so on quickly becomes too
endlessly large a scope. Spark Core to me is already the right abstraction for
upstream ETL of data before entering an ML framework. I mention it just
because it's in the first picture, but I don't see discussion of actually doing
user/product attribute selection later. So maybe it's not meant to be part of
the proposal.

I'd certainly like to keep up more with your work here. This is a big step
forward in making MLlib more relevant to production deployments rather than
just pure algorithms implementations.

Pipeline and Parameters
---

Key: SPARK-3530
URL: https://issues.apache.org/jira/browse/SPARK-3530
Project: Spark
Issue Type: Sub-task
Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

This part of the design doc is for pipelines and parameters. I put the design
doc at
https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
I will copy the proposed interfaces to this JIRA later. Some sample code can
be viewed at: https://github.com/mengxr/spark-ml/
Please help review the design and post your comments here. Thanks!

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable

2014-09-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3470.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 Have JavaSparkContext implement Closeable/AutoCloseable
 ---

 Key: SPARK-3470
 URL: https://issues.apache.org/jira/browse/SPARK-3470
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Shay Rojansky
Priority: Minor
 Fix For: 1.2.0


 After discussion in SPARK-2972, it seems like a good idea to allow Java 
 developers to use Java 7 automatic resource management with JavaSparkContext, 
 like so:
 {code:java}
 try (JavaSparkContext ctx = new JavaSparkContext(...)) {
return br.readLine();
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1895) Run tests on windows

2014-09-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133964#comment-14133964
 ] 

Sean Owen commented on SPARK-1895:
--

Can anyone still reproduce this? I know test temp file cleanup was improved in 
1.0.x, and am not sure I have heard of this since.

 Run tests on windows
 

 Key: SPARK-1895
 URL: https://issues.apache.org/jira/browse/SPARK-1895
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Windows
Affects Versions: 0.9.1
 Environment: spark-0.9.1-bin-hadoop1
Reporter: stribog
Priority: Trivial

 bin\pyspark python\pyspark\rdd.py
 Sometimes tests complete without error _.
 Last tests fail log:
 
 14/05/21 18:31:40 INFO Executor: Running task ID 321
 14/05/21 18:31:40 INFO Executor: Running task ID 324
 14/05/21 18:31:40 INFO Executor: Running task ID 322
 14/05/21 18:31:40 INFO Executor: Running task ID 323
 14/05/21 18:31:40 INFO PythonRDD: Times: total = 241, boot = 240, init = 1, 
 finish = 0
 14/05/21 18:31:40 INFO Executor: Serialized size of result for 324 is 607
 14/05/21 18:31:40 INFO Executor: Sending result for 324 directly to driver
 14/05/21 18:31:40 INFO Executor: Finished task ID 324
 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 324 in 248 ms on 
 localhost (progress: 1/4)
 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 3)
 14/05/21 18:31:40 INFO PythonRDD: Times: total = 518, boot = 516, init = 2, 
 finish = 0
 14/05/21 18:31:40 INFO Executor: Serialized size of result for 323 is 607
 14/05/21 18:31:40 INFO Executor: Sending result for 323 directly to driver
 14/05/21 18:31:40 INFO Executor: Finished task ID 323
 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 323 in 528 ms on 
 localhost (progress: 2/4)
 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 2)
 14/05/21 18:31:41 INFO PythonRDD: Times: total = 776, boot = 774, init = 2, 
 finish = 0
 14/05/21 18:31:41 INFO Executor: Serialized size of result for 322 is 607
 14/05/21 18:31:41 INFO Executor: Sending result for 322 directly to driver
 14/05/21 18:31:41 INFO Executor: Finished task ID 322
 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 322 in 785 ms on 
 localhost (progress: 3/4)
 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 1)
 14/05/21 18:31:41 INFO PythonRDD: Times: total = 1043, boot = 1042, init = 1, 
 finish = 0
 14/05/21 18:31:41 INFO Executor: Serialized size of result for 321 is 607
 14/05/21 18:31:41 INFO Executor: Sending result for 321 directly to driver
 14/05/21 18:31:41 INFO Executor: Finished task ID 321
 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 321 in 1049 ms on 
 localhost (progress: 4/4)
 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 0)
 14/05/21 18:31:41 INFO TaskSchedulerImpl: Removed TaskSet 80.0, whose tasks 
 have all completed, from pool
 14/05/21 18:31:41 INFO DAGScheduler: Stage 80 (top at doctest 
 __main__.RDD.top[0]:1) finished in 1,051 s
 14/05/21 18:31:41 INFO SparkContext: Job finished: top at doctest 
 __main__.RDD.top[0]:1, took 1.053832912 s
 14/05/21 18:31:41 INFO SparkContext: Starting job: top at doctest 
 __main__.RDD.top[1]:1
 14/05/21 18:31:41 INFO DAGScheduler: Got job 63 (top at doctest 
 __main__.RDD.top[1]:1) with 4 output partitions (allowLocal=false)
 14/05/21 18:31:41 INFO DAGScheduler: Final stage: Stage 81 (top at doctest 
 __main__.RDD.top[1]:1)
 14/05/21 18:31:41 INFO DAGScheduler: Parents of final stage: List()
 14/05/21 18:31:41 INFO DAGScheduler: Missing parents: List()
 14/05/21 18:31:41 INFO DAGScheduler: Submitting Stage 81 (PythonRDD[213] at 
 top at doctest __main__.RDD.top[1]:1), which has no missing parents
 14/05/21 18:31:41 INFO DAGScheduler: Submitting 4 missing tasks from Stage 81 
 (PythonRDD[213] at top at doctest __main__.RDD.top[1]:1)
 14/05/21 18:31:41 INFO TaskSchedulerImpl: Adding task set 81.0 with 4 tasks
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:0 as TID 325 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:0 as 2594 bytes 
 in 0 ms
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:1 as TID 326 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:1 as 2594 bytes 
 in 0 ms
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:2 as TID 327 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:2 as 2594 bytes 
 in 0 ms
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:3 as TID 328 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:3 as 2609 bytes 
 in 1 ms
 14/05/21 18:31:41 INFO Executor: Running task ID 326
 14/05/21 18:31:41 INFO Executor:

[jira] [Resolved] (SPARK-1258) RDD.countByValue optimization

2014-09-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1258.
--
Resolution: Won't Fix

I'm taking the liberty of closing this, since this refers to an optimization 
using fastutil classes, which were removed from Spark. An equivalent 
optimization is employed now, using Spark's OpenHashMap.

 RDD.countByValue optimization
 -

 Key: SPARK-1258
 URL: https://issues.apache.org/jira/browse/SPARK-1258
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Jaroslav Kamenik
Priority: Trivial

 Class Object2LongOpenHashMap has method add(key, incr) (addTo in new version) 
 for incrementation value assigned to the key. It should be faster than 
 currently used  map.put(v, map.getLong(v) + 1L) .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3506) 1.1.0-SNAPSHOT in docs for 1.1.0 under docs/latest

2014-09-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133976#comment-14133976
 ] 

Sean Owen commented on SPARK-3506:
--

Yeah, I imagine that can be touched up right now. For the future, I imagine the 
issue was just that the site was built from the branch before the release 
plugin upped the version and created the artifacts? So the site might be better 
built from the final released source artifact.

I imagine it's a release-process doc change but don't know whether that lives.

 1.1.0-SNAPSHOT in docs for 1.1.0 under docs/latest
 --

 Key: SPARK-3506
 URL: https://issues.apache.org/jira/browse/SPARK-3506
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Jacek Laskowski
Priority: Trivial

 In https://spark.apache.org/docs/latest/ there are references to 
 1.1.0-SNAPSHOT:
 * This documentation is for Spark version 1.1.0-SNAPSHOT.
 * For the Scala API, Spark 1.1.0-SNAPSHOT uses Scala 2.10.
 It should be version 1.1.0 since that's the latest released version and the 
 header tells so, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2014-09-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134096#comment-14134096
 ] 

Sean Owen commented on SPARK-2620:
--

FWIW, here is a mailing list comment that suggests 1.1 works with these case 
classes, although this is not a case where the REPL is being used:

http://apache-spark-user-list.1001560.n3.nabble.com/Compiler-issues-for-multiple-map-on-RDD-td14248.html

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3563) Shuffle data not always be cleaned

2014-09-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136899#comment-14136899
 ] 

Sean Owen commented on SPARK-3563:
--

I am no expert, but I believe this is on purpose, in order to reuse the shuffle 
if the RDD partition needs to be recomputed. You may need to set a lower 
spark.cleaner.ttl?

 Shuffle data not always be cleaned
 --

 Key: SPARK-3563
 URL: https://issues.apache.org/jira/browse/SPARK-3563
 Project: Spark
  Issue Type: Bug
  Components: core
Affects Versions: 1.0.2
Reporter: shenhong

 In our cluster, when we run a spark streaming job, after running for many 
 hours, the shuffle data seems not all be cleaned, here is the shuffle data:
 -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1
 The shuffle data of stage 63, 65, 84, 132... are not cleaned.
 In ContextCleaner, it maintains a weak reference for each RDD, 
 ShuffleDependency, and Broadcast of interest,  to be processed when the 
 associated object goes out of scope of the application. Actual  cleanup is 
 performed in a separate daemon thread. 
 There must be some  reference for ShuffleDependency , and it's hard to find 
 out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2014-09-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142383#comment-14142383
 ] 

Sean Owen commented on SPARK-3621:
--

My understanding is that this is fairly fundamentally not possible in Spark. 
The metadata and machinery necessary to operate on RDDs is with the driver. 
RDDs are not accessible within transformations or actions. I'm interested both 
in whether that is in fact true, how much of an issue it really is for 
Hive-on-Spark to use collect + broadcast, and whether these sorts of 
requirements can be met with join, cogroup, etc.

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2014-09-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143009#comment-14143009
 ] 

Sean Owen commented on SPARK-3621:
--

If the data is shipped to the worker node, and the driver is the thing that can 
marshal the data to be sent, how is it different from a Broadcast variable? the 
broadcast can be done efficiently with the torrent-based broadcast, for 
example. 

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work

2014-09-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143153#comment-14143153
 ] 

Sean Owen commented on SPARK-3625:
--

This prints 1000 both times for me, which is correct. When you say doesn't 
work, could you please elaborate? different count? exception? what is your 
environment?

 In some cases, the RDD.checkpoint does not work
 ---

 Key: SPARK-3625
 URL: https://issues.apache.org/jira/browse/SPARK-3625
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Guoqiang Li
Assignee: Guoqiang Li
Priority: Blocker

 The reproduce code:
 {code}
  sc.setCheckpointDir(checkpointDir)
 val c = sc.parallelize((1 to 1000))
 c.count
 c.checkpoint()
 c.count
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work

2014-09-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143342#comment-14143342
 ] 

Sean Owen commented on SPARK-3625:
--

It still prints 1000 both times, which is correct. Your assertion is about 
something different. The assertion fails, but, the behavior you are asserting 
is not what the javadoc suggests:

{quote}
Mark this RDD for checkpointing. It will be saved to a file inside the 
checkpoint
directory set with SparkContext.setCheckpointDir() and all references to its 
parent
RDDs will be removed. This function must be called before any job has been
executed on this RDD. It is strongly recommended that this RDD is persisted in
memory, otherwise saving it on a file will require recomputation.
{quote}

This example calls count() before checkpoint(). If you don't, I think you get 
the expected behavior, since the dependency becomes a CheckpointRDD. This looks 
like not a bug.

 In some cases, the RDD.checkpoint does not work
 ---

 Key: SPARK-3625
 URL: https://issues.apache.org/jira/browse/SPARK-3625
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Guoqiang Li
Assignee: Guoqiang Li
Priority: Blocker

 The reproduce code:
 {code}
 sc.setCheckpointDir(checkpointDir)
 val c = sc.parallelize((1 to 1000)).map(_ + 1)
 c.count
 val dep = c.dependencies.head.rdd
 c.checkpoint()
 c.count
 assert(dep != c.dependencies.head.rdd)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-09-22 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143724#comment-14143724
]

Sean Owen commented on SPARK-3431:
--

It's trivial to configure Maven surefire/failsafe to execute tests in parallel.
It can parallelize by class or method, fork or not, control number of
concurrent forks as a multiple of cores, etc. For example, it's no problem to
make test classes use their own JVM, and not even reuse JVMs if you don't want.

The harder part is making the tests play nice with each other on one machine
when it comes to shared resources: files and ports, really. I think the tests
have had several passes of improvements to reliably use their own temp space,
and try to use an unused port, but this is one typical cause of test breakage.
It's not yet clear that tests don't clobber each other by trying to use the
same default Spark working dir or something.

Finally, some tests that depend on a certain sequence of random numbers may
need to be made more robust.

but the parallelization is trivial in Maven, at least.

Parallelize execution of tests
--

Key: SPARK-3431
URL: https://issues.apache.org/jira/browse/SPARK-3431
Project: Spark
Issue Type: Improvement
Components: Build
Reporter: Nicholas Chammas

Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common
strategy to cut test time down is to parallelize the execution of the tests.
Doing that may in turn require some prerequisite changes to be made to how
certain tests run.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-09-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143835#comment-14143835
 ] 

Sean Owen commented on SPARK-3431:
--

For your experiments, scalatest just copies an old subset of surefire's config:

http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin
vs
http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html

You can see discussion of how forkMode works:

http://maven.apache.org/surefire/maven-surefire-plugin/examples/fork-options-and-parallel-execution.html

Bad news is that scalatest's support is much more limited, but parallel=true 
and forkMode=once might do the trick.
Otherwise... I guess we can figure out if it's realistic to use standard 
surefire instead of scalatest.


 Parallelize execution of tests
 --

 Key: SPARK-3431
 URL: https://issues.apache.org/jira/browse/SPARK-3431
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas

 Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
 strategy to cut test time down is to parallelize the execution of the tests. 
 Doing that may in turn require some prerequisite changes to be made to how 
 certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3656) IllegalArgumentException when I using sort-based shuffle

2014-09-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144588#comment-14144588
 ] 

Sean Owen commented on SPARK-3656:
--

Duplicate of https://issues.apache.org/jira/browse/SPARK-3032  This was 
discussed even today on the mailing list.

 IllegalArgumentException when I using sort-based shuffle
 

 Key: SPARK-3656
 URL: https://issues.apache.org/jira/browse/SPARK-3656
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.1.0
Reporter: yangping wu
   Original Estimate: 8h
  Remaining Estimate: 8h

 The code work fine in hash-based shuffle.
 {code}
 sc.textFile(file:///export1/spark/zookeeper.out).flatMap(l = l.split( 
 )).map(w=(w,1)).reduceByKey(_ + _).collect
 {code}
 But when  I  test the program using sort-based shuffle,the program encounters 
 an error：
 {code}
 scala sc.textFile(file:///export1/spark/zookeeper.out).flatMap(l = 
 l.split( )).map(w=(w,1)).reduceByKey(_ + _).collect
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in 
 stage 1.0 failed 1 times, most recent failure: Lost task 22.0 in stage 1.0 
 (TID 22, localhost): java.lang.IllegalArgumentException: Comparison method 
 violates its general contract!
 
 org.apache.spark.util.collection.Sorter$SortState.mergeHi(Sorter.java:876)
 
 org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:495)
 
 org.apache.spark.util.collection.Sorter$SortState.mergeForceCollapse(Sorter.java:436)
 
 org.apache.spark.util.collection.Sorter$SortState.access$300(Sorter.java:294)
 org.apache.spark.util.collection.Sorter.sort(Sorter.java:137)
 
 org.apache.spark.util.collection.AppendOnlyMap.destructiveSortedIterator(AppendOnlyMap.scala:271)
 
 org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323)
 
 org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271)
 
 org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249)
 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:212)
 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:67)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:619)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe,

[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example

2014-09-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146017#comment-14146017
 ] 

Sean Owen commented on SPARK-3662:
--

Maybe I miss something, but, does this just mean you can't import pandas 
entirely? If you're modifying the example, you should import only what you need 
from pandas. Or, it may be that you need to modify the import random, indeed, 
to accommodate other modifications you want to make.

But what is the problem with the included example? it runs fine without 
modifications, no?

 Importing pandas breaks included pi.py example
 --

 Key: SPARK-3662
 URL: https://issues.apache.org/jira/browse/SPARK-3662
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.1.0
 Environment: Xubuntu 14.04.  Yarn cluster running on Ubuntu 12.04.
Reporter: Evan Samanas

 If I add import pandas at the top of the included pi.py example and submit 
 using spark-submit --master yarn-client, I get this stack trace:
 {code}
 Traceback (most recent call last):
   File /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py, line 
 39, in module
 count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add)
   File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 759, in reduce
 vals = self.mapPartitions(func).collect()
   File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 723, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File 
 /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, 
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 
 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: 
 org.apache.spark.api.python.PythonException (Traceback (most recent call 
 last):
   File 
 /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py,
  line 75, in main
 command = pickleSer._read_with_length(infile)
   File 
 /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py,
  line 150, in _read_with_length
 return self.loads(obj)
 ImportError: No module named algos
 {code}
 The example works fine if I move the statement from random import random 
 from the top and into the function (def f(_)) defined in the example.  Near 
 as I can tell, random is getting confused with a function of the same name 
 within pandas.algos.  
 Submitting the same script using --master local works, but gives a 
 distressing amount of random characters to stdout or stderr and messes up my 
 terminal:
 {code}
 ...
 @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J 
 @J!@J@J#@J$@J%@J@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J@J=@J@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ
AJ
 AJ
   AJ
 AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ 
 AJ!AJAJ#AJ$AJ%AJAJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ=AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23
  15:42:09 INFO SparkContext: Job finished: reduce at 
 /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 
 11.276879779 s
 J�AJ�AJ�AJ�AJ�AJ�AJ�A�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ
  BJ
 BJ
   BJ
 BJBJBJBJBJBJBJBJBJBJBJBJBJBJJBJBJBJBJ 
 BJ!BJBJ#BJ$BJ%BJBJ'BJ(BJ)BJ*BJ+BJ,BJ-BJ.BJ/BJ0BJ1BJ2BJ3BJ4BJ5BJ6BJ7BJ8BJ9BJ:BJ;BJBJ=BJBJ?BJ@Be.
 �]qJ#1a.
 �]qJX4a.
 �]qJX4a.
 �]qJ#1a.
 �]qJX4a.
 �]qJX4a.
 �]qJ#1a.
 �]qJX4a.
 �]qJX4a.
 �]qJa.
 Pi is roughly 3.146136
 {code}
 No idea if that's related, but thought I'd include it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3676) jdk version lead to spark sql test suite error

2014-09-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146040#comment-14146040
 ] 

Sean Owen commented on SPARK-3676:
--

(For the interested, I looked it up, since the behavior change sounds 
surprising. This is in fact a bug in Java 6 that was fixed in Java 7: 
http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4428022 It may even be 
fixed in later versions of Java 6, but I have a very recent one and it is not.)

 jdk version lead to spark sql test suite error
 --

 Key: SPARK-3676
 URL: https://issues.apache.org/jira/browse/SPARK-3676
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


 System.out.println(1/500d)  get different result in diff jdk version
 jdk 1.6.0(_31)  0.0020
 jdk 1.7.0(_05)  0.002
 this will lead to  spark sql hive test suite failed (replay by set jdk 
 version = 1.6.0_31)--- 
 [info] - division *** FAILED ***
 [info]   Results do not match for division:
 [info]   SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1
 [info]   == Parsed Logical Plan ==
 [info]   Limit 1
 [info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS 
 c_2#694,(1 / COUNT(1)) AS c_3#695]
 [info] UnresolvedRelation None, src, None
 [info]   
 [info]   == Analyzed Logical Plan ==
 [info]   Limit 1
 [info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS 
 c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, 
 DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub
 leType) / CAST(COUNT(1), DoubleType)) AS c_3#695]
 [info] MetastoreRelation default, src, None
 [info]   
 [info]   == Optimized Logical Plan ==
 [info]   Limit 1
 [info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS 
 c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695]
 [info] Project []
 [info]  MetastoreRelation default, src, None
 [info]   
 [info]   == Physical Plan ==
 [info]   Limit 1
 [info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS 
 c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), 
 DoubleType)) AS c_3#695]
 [info] Exchange SinglePartition
 [info]  Aggregate true, [], [COUNT(1) AS PartialCount#699L]
 [info]   HiveTableScan [], (MetastoreRelation default, src, None), None
 [info]   
 [info]   Code Generation: false
 [info]   == RDD ==
 [info]   c_0c_1 c_2 c_3
 [info]   !== HIVE - 1 row(s) ==  == CATALYST - 1 row(s) ==
 [info]   !2.0   0.5 0.  0.002   2.0 0.5 
 0.  0.0020 (HiveComparisonTest.scala:370)
 [info] - timestamp cast #1 *** FAILED ***
 [info]   Results do not match for timestamp cast #1:
 [info]   SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1
 [info]   == Parsed Logical Plan ==
 [info]   Limit 1
 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
 [info] UnresolvedRelation None, src, None
 [info]   
 [info]   == Analyzed Logical Plan ==
 [info]   Limit 1
 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
 [info] MetastoreRelation default, src, None
 [info]   
 [info]   == Optimized Logical Plan ==
 [info]   Limit 1
 [info]Project [0.0010 AS c_0#995]
 [info] MetastoreRelation default, src, None
 [info]   
 [info]   == Physical Plan ==
 [info]   Limit 1
 [info]Project [0.0010 AS c_0#995]
 [info] HiveTableScan [], (MetastoreRelation default, src, None), None
 [info]   
 [info]   Code Generation: false
 [info]   == RDD ==
 [info]   c_0
 [info]   !== HIVE - 1 row(s) ==   == CATALYST - 1 row(s) ==
 [info]   !0.001   0.0010 (HiveComparisonTest.scala:370)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3586) Support nested directories in Spark Streaming

2014-09-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3586:
-
Issue Type: Improvement  (was: Bug)
   Summary: Support nested directories in Spark Streaming  (was: spark 
streaming )

 Support nested directories in Spark Streaming
 -

 Key: SPARK-3586
 URL: https://issues.apache.org/jira/browse/SPARK-3586
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: wangxj
Priority: Minor
  Labels: patch
 Fix For: 1.1.0


 For  text files, the method streamingContext.textFileStream(dataDirectory). 
 Spark Streaming will monitor the directory dataDirectory and process any 
 files created in that directory.but files written in nested directories not 
 supported
 eg
 streamingContext.textFileStream(/test). 
 Look at the direction contents:
 /test/file1
 /test/file2
 /test/dr/file1
 In this mothod the textFileStream can only read file:
 /test/file1
 /test/file2
 /test/dr/
 but the file: /test/dr/file1  is not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3603) InvalidClassException on a Linux VM - probably problem with serialization

2014-09-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150700#comment-14150700
 ] 

Sean Owen commented on SPARK-3603:
--

Can you clarify what works and doesn't work -- works when both client and 
master use the same JVM, but, fails if they use different Linux JVMs? I'm 
wondering if it is actually going to be required to use the same JVMs, or at 
least, ones that generate serialVersionUID in the same way (which can be 
JVM-specific). Spark/Scala throw around so much serialized stuff that you 
notice pretty quickly, and setting a fixed serialVersionUID for the Scala 
classes is of course not feasible. You could try Kryo serialization too.

 InvalidClassException on a Linux VM - probably problem with serialization
 -

 Key: SPARK-3603
 URL: https://issues.apache.org/jira/browse/SPARK-3603
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0, 1.1.0
 Environment: Linux version 2.6.32-358.32.3.el6.x86_64 
 (mockbu...@x86-029.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
 Hat 4.4.7-3) (GCC) ) #1 SMP Fri Jan 17 08:42:31 EST 2014
 java version 1.7.0_25
 OpenJDK Runtime Environment (rhel-2.3.10.4.el6_4-x86_64)
 OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
 Spark (either 1.0.0 or 1.1.0)
Reporter: Tomasz Dudziak
Priority: Critical
  Labels: scala, serialization, spark

 I have a Scala app connecting to a standalone Spark cluster. It works fine on 
 Windows or on a Linux VM; however, when I try to run the app and the Spark 
 cluster on another Linux VM (the same Linux kernel, Java and Spark - tested 
 for versions 1.0.0 and 1.1.0) I get the below exception. This looks kind of 
 similar to the Big-Endian (IBM Power7) Spark Serialization issue 
 (SPARK-2018), but... my system is definitely little endian and I understand 
 the big endian issue should be already fixed in Spark 1.1.0 anyway. I'd 
 appreaciate your help.
 01:34:53.251 WARN  [Result resolver thread-0][TaskSetManager] Lost TID 2 
 (task 1.0:2)
 01:34:53.278 WARN  [Result resolver thread-0][TaskSetManager] Loss was due to 
 java.io.InvalidClassException
 java.io.InvalidClassException: scala.reflect.ClassTag$$anon$1; local class 
 incompatible: stream classdesc serialVersionUID = -4937928798201944954, local 
 class serialVersionUID = -8102093212602380348
 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
 at 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
 at 
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1769)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
 at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
 at 
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
 at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
 at 
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
 at 
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1891)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
 at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
 at 
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
 at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
 at 
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
 at

[jira] [Resolved] (SPARK-1279) Stage.name return apply at Option.scala:120

2014-09-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1279.
--
Resolution: Duplicate

Obvious accidental dupe of SPARK-1280

 Stage.name return  apply at Option.scala:120
 --

 Key: SPARK-1279
 URL: https://issues.apache.org/jira/browse/SPARK-1279
 Project: Spark
  Issue Type: Bug
Reporter: Guoqiang Li





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2960) Spark executables fail to start via symlinks

2014-09-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2960.
--
   Resolution: Duplicate
Fix Version/s: (was: 1.0.2)

I suggest this be marked a dupe of 
https://issues.apache.org/jira/browse/SPARK-3482 as the latter appears to be 
the same and has an open PR.

 Spark executables fail to start via symlinks
 

 Key: SPARK-2960
 URL: https://issues.apache.org/jira/browse/SPARK-2960
 Project: Spark
  Issue Type: Bug
Reporter: Shay Rojansky
Priority: Minor

 The current scripts (e.g. pyspark) fail to run when they are executed via 
 symlinks. A common Linux scenario would be to have Spark installed somewhere 
 (e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3504) KMeans clusterer is slow, can be sped up by 75%

2014-09-27 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150709#comment-14150709
]

Sean Owen commented on SPARK-3504:
--

Is this the same as https://issues.apache.org/jira/browse/SPARK-3424 ? The
latter has a pull request, so should this resolve as duplicate in favor of the
other?

KMeans clusterer is slow, can be sped up by 75%
---

Key: SPARK-3504
URL: https://issues.apache.org/jira/browse/SPARK-3504
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 1.0.2
Reporter: Derrick Burns

The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because
recomputes all distances to all cluster centers on each iteration. In later
iterations of Lloyd's algorithm, points don't change clusters and clusters
don't move.
By 1) tracking which clusters move and 2) tracking for each point which
cluster it belongs to and the distance to that cluster, one can avoid
recomputing distances in many cases with very little increase in memory
requirements.
I implemented this new algorithm and the results were fantastic. Using 16
c3.8xlarge machines on EC2, the clusterer converged in 13 iterations on
1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here
are the running times for the first 7 rounds:
6 minutes and 42 second
7 minutes and 7 seconds
7 minutes 13 seconds
1 minutes 18 seconds
30 seconds
18 seconds
12 seconds
Without this improvement, all rounds would have taken roughly 7 minutes,
resulting in Lloyd's iterations taking 7 * 13 = 91 minutes. In other words,
this improvement resulting in a reduction of roughly 75% in running time with
no loss of accuracy.
My implementation is a rewrite of the existing 1.0.2 implementation. It is
not a simple modification of the existing implementation. Please let me know
if you are interested in this new implementation.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ

2014-09-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3324.
--
Resolution: Won't Fix

I propose considering this WontFix, as it will go away when yarn-alpha goes 
away anyway, and that seems to be coming in the medium term.

 YARN module has nonstandard structure which cause compile error In IntelliJ
 ---

 Key: SPARK-3324
 URL: https://issues.apache.org/jira/browse/SPARK-3324
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
 Environment: Mac OS: 10.9.4
 IntelliJ IDEA: 13.1.4
 Scala Plugins: 0.41.2
 Maven: 3.0.5
Reporter: Yi Tian
Assignee: Patrick Wendell
Priority: Minor
  Labels: intellij, maven, yarn

 The YARN module has nonstandard path structure like:
 {code}
 ${SPARK_HOME}
   |--yarn
  |--alpha (contains yarn api support for 0.23 and 2.0.x)
  |--stable (contains yarn api support for 2.2 and later)
  | |--pom.xml (spark-yarn)
  |--common (Common codes not depending on specific version of Hadoop)
  |--pom.xml (yarn-parent)
 {code}
 When we use maven to compile yarn module, maven will import 'alpha' or 
 'stable' module according to profile setting.
 And the submodule like 'stable' use the build propertie defined in 
 yarn/pom.xml to import common codes to sourcePath.
 It will cause IntelliJ can't directly recognize sources in common directory 
 as sourcePath. 
 I thought we should change the yarn module to a unified maven jar project, 
 and add specify different version of yarn api via maven profile setting.
 It will resolve the compile error in IntelliJ and make the yarn module more 
 simple and clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3195) Can you add some statistics to do logistic regression better in mllib?

2014-09-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3195:
-
Priority: Minor  (was: Major)
Target Version/s:   (was: 1.3.0)
   Fix Version/s: (was: 1.3.0)
  Labels:   (was: test)

This sounds like a question more than a JIRA, and questions are better 
discussed on the mailing list. Can you clarify what you want, and propose a PR, 
or else close?

 Can you add some statistics to do logistic regression better in mllib?
 --

 Key: SPARK-3195
 URL: https://issues.apache.org/jira/browse/SPARK-3195
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: miumiu
Priority: Minor
   Original Estimate: 1m
  Remaining Estimate: 1m

 HI，
 In logistic regression model practice，Test of regression coefficient and 
 whole model fitting are very important.Can you add some effective support on 
 these Aspects?
 Such as,The likelihood ratio test or the wald test is offer used for test 
 of coefficient,and the Hosmer-Lemeshow test is used for evaluate the model 
 fitting.
 Learning that we have ROC and Precision-Recall already,but can you also 
 provide KS statistic,which is mostly used in Model evaluation aspect?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1046) Enable to build behind a proxy.

2014-09-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1046.
--
Resolution: Fixed

 Enable to build behind a proxy.
 ---

 Key: SPARK-1046
 URL: https://issues.apache.org/jira/browse/SPARK-1046
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.8.1
Reporter: Kousuke Saruta
Priority: Minor

 I tried to build spark-0.8.1 behind proxy and failed although I set 
 http/https.proxyHost, proxyPort, proxyUser, proxyPassword.
 I found it's caused by accessing  github using git protocol (git://).
 The URL is hard-corded in SparkPluginBuild.scala as follows.
 {code}
 lazy val junitXmlListener = 
 uri(git://github.com/ijuma/junit_xml_listener.git#fe434773255b451a38e8d889536ebc260f4225ce)
 {code}
 After I rewrite the URL as follows, I could build successfully.
 {code}
 lazy val junitXmlListener = 
 uri(https://github.com/ijuma/junit_xml_listener.git#fe434773255b451a38e8d889536ebc260f4225ce;)
 {code}
 I think we should be able to build whether we are behind a proxy or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3101) Missing volatile annotation in ApplicationMaster

2014-09-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3101.
--
Resolution: Fixed

This was actually subsumed by the commit for SPARK-2933

 Missing volatile annotation in ApplicationMaster
 

 Key: SPARK-3101
 URL: https://issues.apache.org/jira/browse/SPARK-3101
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 In ApplicationMaster, a field variable 'isLastAMRetry' is used as a flag but 
 it's not declared as volatile though it's used from multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2815) Compilation failed upon the hadoop version 2.0.0-cdh4.5.0

2014-09-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2815.
--
  Resolution: Won't Fix
Target Version/s:   (was: 1.1.0)

This looks like a WontFix, given the discussion.

 Compilation failed upon the hadoop version 2.0.0-cdh4.5.0
 -

 Key: SPARK-2815
 URL: https://issues.apache.org/jira/browse/SPARK-2815
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: pengyanhong
Assignee: Guoqiang Li

 compile fail via SPARK_HADOOP_VERSION=2.0.0-cdh4.5.0 SPARK_YARN=true 
 SPARK_HIVE=true sbt/sbt assembly,  finally get error message : [error] 
 (yarn-stable/compile:compile) Compilation failed, the following is the detail 
 error on console:
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:26:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.YarnClient
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:40:
  not found: value YarnClient
 [error]   val yarnClient = YarnClient.createYarnClient
 [error]^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:32:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.AMRMClient
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:33:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.AMRMClient.ContainerRequest
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:36:
  object util is not a member of package org.apache.hadoop.yarn.webapp
 [error] import org.apache.hadoop.yarn.webapp.util.WebAppUtils
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:64:
  value RM_AM_MAX_ATTEMPTS is not a member of object 
 org.apache.hadoop.yarn.conf.YarnConfiguration
 [error] YarnConfiguration.RM_AM_MAX_ATTEMPTS, 
 YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS)
 [error]   ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:66:
  not found: type AMRMClient
 [error]   private var amClient: AMRMClient[ContainerRequest] = _
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:92:
  not found: value AMRMClient
 [error] amClient = AMRMClient.createAMRMClient()
 [error]^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:137:
  not found: value WebAppUtils
 [error] val proxy = WebAppUtils.getProxyHostAndPort(conf)
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:40:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.AMRMClient
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:618:
  not found: type AMRMClient
 [error]   amClient: AMRMClient[ContainerRequest],
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:596:
  not found: type AMRMClient
 [error]   amClient: AMRMClient[ContainerRequest],
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:577:
  not found: type AMRMClient
 [error]   amClient: AMRMClient[ContainerRequest],
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:410:
  value CONTAINER_ID is not a member of object 
 org.apache.hadoop.yarn.api.ApplicationConstants.Environment
 [error] val containerIdString = 
 System.getenv(ApplicationConstants.Environment.CONTAINER_ID.name())
 [error]   
  ^
 [error]

[jira] [Commented] (SPARK-3652) upgrade spark sql hive version to 0.13.1

2014-09-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150719#comment-14150719
 ] 

Sean Owen commented on SPARK-3652:
--

Yes, there is a lot to updating to 0.13 as you can see in the other JIRA, and 
the other seems like a cleaner approach. But even that may not be committed. I 
suggest resolving this one as a duplicate and trying to help the effort in 
SPARK-2706 instead, which is being used as a patch by some already evidently.

 upgrade spark sql hive version to 0.13.1
 

 Key: SPARK-3652
 URL: https://issues.apache.org/jira/browse/SPARK-3652
 Project: Spark
  Issue Type: Dependency upgrade
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei

 now spark sql hive version is 0.12.0, compile with 0.13.1 will get errors. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2655) Change the default logging level to WARN

2014-09-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150740#comment-14150740
 ] 

Sean Owen commented on SPARK-2655:
--

Given the PR discussion, this is a WontFix?

 Change the default logging level to WARN
 

 Key: SPARK-2655
 URL: https://issues.apache.org/jira/browse/SPARK-2655
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu

 The current logging level INFO is pretty noisy, reduce these unnecessary 
 logging will provide better experience for users.
 Currently, Spark is march stable and nature than before, so user will not 
 need those much logging in normal cases. But some high level information will 
 be helpful, such as messages about job and tasks progress, we could changes 
 these important logging into WARN level as an hack, otherwise will need to 
 change all other logging into level DEBUG.
 PS: it's better to have one line progress logging in terminal (also in title).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2331) SparkContext.emptyRDD has wrong return type

2014-09-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150746#comment-14150746
 ] 

Sean Owen commented on SPARK-2331:
--

Yes your analysis is right-er, I think. I imagine this won't be changed as it 
introduces an API change, but yes I think the return type should have been 
{{RDD[T]}} As a workaround you can do...

{code}
val empty: RDD[String] = sc.emptyRDD
val rdds = Seq(a,b,c).foldLeft(empty){ (rdd,path) = 
rdd.union(sc.textFile(path)) }
{code}

Anyone else? WontFix, at least for 1.x?

 SparkContext.emptyRDD has wrong return type
 ---

 Key: SPARK-2331
 URL: https://issues.apache.org/jira/browse/SPARK-2331
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Ian Hummel

 The return type for SparkContext.emptyRDD is EmptyRDD[T].
 It should be RDD[T].  That means you have to add extra type annotations on 
 code like the below (which creates a union of RDDs over some subset of paths 
 in a folder)
 val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { 
 (rdd, path) ⇒
   rdd.union(sc.textFile(path))
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2247) Data frame (or Pandas) like API for structured data

2014-09-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150748#comment-14150748
 ] 

Sean Owen commented on SPARK-2247:
--

For pandas, is this what Sparkling Pandas provides? 
https://github.com/holdenk/sparklingpandas

And for R, is this covered by SparkR?
http://amplab-extras.github.io/SparkR-pkg/

Is this something that should therefore live outside Spark?

 Data frame (or Pandas) like API for structured data
 ---

 Key: SPARK-2247
 URL: https://issues.apache.org/jira/browse/SPARK-2247
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core, SQL
Affects Versions: 1.0.0
Reporter: venu k tangirala
  Labels: features

 I would be nice to have R or python pandas like data frames on spark.
 1) To be able to access the RDD data frame from python with pandas 
 2) To be able to access the RDD data frame from R 
 3) To be able to access the RDD data frame from scala's saddle 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2153) CassandraTest fails for newer Cassandra due to case insensitive key space

2014-09-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2153:
-
Summary: CassandraTest fails for newer Cassandra due to case insensitive 
key space  (was: Spark Examples)

 CassandraTest fails for newer Cassandra due to case insensitive key space
 -

 Key: SPARK-2153
 URL: https://issues.apache.org/jira/browse/SPARK-2153
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.0.0
Reporter: vishnu
Priority: Minor
  Labels: examples
 Fix For: 1.0.0

   Original Estimate: 12h
  Remaining Estimate: 12h

 The Spark Example CassandraTest.scala does cannot be built on newer versions 
 of cassandra. I tried it on Cassandra 2.0.8. 
 It is because Cassandra looks case sensitive for the key spaces and stores 
 all the keyspaces in lowercase. And in the example the KeySpace is casDemo 
 . So the program fails with an error stating keyspace not found.
 The new Cassandra jars do not have the org.apache.cassandra.db.IColumn .So 
 instead we have to use org.apache.cassandra.db.Column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1895) Run tests on windows

2014-09-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1895.
--
Resolution: Cannot Reproduce

 Run tests on windows
 

 Key: SPARK-1895
 URL: https://issues.apache.org/jira/browse/SPARK-1895
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Windows
Affects Versions: 0.9.1
 Environment: spark-0.9.1-bin-hadoop1
Reporter: stribog
Priority: Trivial

 bin\pyspark python\pyspark\rdd.py
 Sometimes tests complete without error _.
 Last tests fail log:
 {noformat}
 
 14/05/21 18:31:40 INFO Executor: Running task ID 321
 14/05/21 18:31:40 INFO Executor: Running task ID 324
 14/05/21 18:31:40 INFO Executor: Running task ID 322
 14/05/21 18:31:40 INFO Executor: Running task ID 323
 14/05/21 18:31:40 INFO PythonRDD: Times: total = 241, boot = 240, init = 1, 
 finish = 0
 14/05/21 18:31:40 INFO Executor: Serialized size of result for 324 is 607
 14/05/21 18:31:40 INFO Executor: Sending result for 324 directly to driver
 14/05/21 18:31:40 INFO Executor: Finished task ID 324
 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 324 in 248 ms on 
 localhost (progress: 1/4)
 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 3)
 14/05/21 18:31:40 INFO PythonRDD: Times: total = 518, boot = 516, init = 2, 
 finish = 0
 14/05/21 18:31:40 INFO Executor: Serialized size of result for 323 is 607
 14/05/21 18:31:40 INFO Executor: Sending result for 323 directly to driver
 14/05/21 18:31:40 INFO Executor: Finished task ID 323
 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 323 in 528 ms on 
 localhost (progress: 2/4)
 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 2)
 14/05/21 18:31:41 INFO PythonRDD: Times: total = 776, boot = 774, init = 2, 
 finish = 0
 14/05/21 18:31:41 INFO Executor: Serialized size of result for 322 is 607
 14/05/21 18:31:41 INFO Executor: Sending result for 322 directly to driver
 14/05/21 18:31:41 INFO Executor: Finished task ID 322
 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 322 in 785 ms on 
 localhost (progress: 3/4)
 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 1)
 14/05/21 18:31:41 INFO PythonRDD: Times: total = 1043, boot = 1042, init = 1, 
 finish = 0
 14/05/21 18:31:41 INFO Executor: Serialized size of result for 321 is 607
 14/05/21 18:31:41 INFO Executor: Sending result for 321 directly to driver
 14/05/21 18:31:41 INFO Executor: Finished task ID 321
 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 321 in 1049 ms on 
 localhost (progress: 4/4)
 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 0)
 14/05/21 18:31:41 INFO TaskSchedulerImpl: Removed TaskSet 80.0, whose tasks 
 have all completed, from pool
 14/05/21 18:31:41 INFO DAGScheduler: Stage 80 (top at doctest 
 __main__.RDD.top[0]:1) finished in 1,051 s
 14/05/21 18:31:41 INFO SparkContext: Job finished: top at doctest 
 __main__.RDD.top[0]:1, took 1.053832912 s
 14/05/21 18:31:41 INFO SparkContext: Starting job: top at doctest 
 __main__.RDD.top[1]:1
 14/05/21 18:31:41 INFO DAGScheduler: Got job 63 (top at doctest 
 __main__.RDD.top[1]:1) with 4 output partitions (allowLocal=false)
 14/05/21 18:31:41 INFO DAGScheduler: Final stage: Stage 81 (top at doctest 
 __main__.RDD.top[1]:1)
 14/05/21 18:31:41 INFO DAGScheduler: Parents of final stage: List()
 14/05/21 18:31:41 INFO DAGScheduler: Missing parents: List()
 14/05/21 18:31:41 INFO DAGScheduler: Submitting Stage 81 (PythonRDD[213] at 
 top at doctest __main__.RDD.top[1]:1), which has no missing parents
 14/05/21 18:31:41 INFO DAGScheduler: Submitting 4 missing tasks from Stage 81 
 (PythonRDD[213] at top at doctest __main__.RDD.top[1]:1)
 14/05/21 18:31:41 INFO TaskSchedulerImpl: Adding task set 81.0 with 4 tasks
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:0 as TID 325 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:0 as 2594 bytes 
 in 0 ms
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:1 as TID 326 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:1 as 2594 bytes 
 in 0 ms
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:2 as TID 327 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:2 as 2594 bytes 
 in 0 ms
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:3 as TID 328 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:3 as 2609 bytes 
 in 1 ms
 14/05/21 18:31:41 INFO Executor: Running task ID 326
 14/05/21 18:31:41 INFO Executor: Running task ID 328
 14/05/21 18:31:41 INFO Executor: Running task ID 327
 14/05/21 18:31:41 INFO Executor: Running task ID 325
 14/05/21

[jira] [Commented] (SPARK-2517) Remove as many compilation warning messages as possible

2014-09-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150753#comment-14150753
 ] 

Sean Owen commented on SPARK-2517:
--

[~rxin] I think you resolved this? I don't see these warnings anymore. (Hurrah.)

 Remove as many compilation warning messages as possible
 ---

 Key: SPARK-2517
 URL: https://issues.apache.org/jira/browse/SPARK-2517
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Yin Huai
Priority: Minor

 We should probably treat warnings as failures in Jenkins.
 Some examples:
 {code}
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:138:
  abstract type ExpectedAppender is unchecked since it is eliminated by erasure
 [warn]   assert(appender.isInstanceOf[ExpectedAppender])
 [warn]   ^
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:143:
  abstract type ExpectedRollingPolicy is unchecked since it is eliminated by 
 erasure
 [warn] rollingPolicy.isInstanceOf[ExpectedRollingPolicy]
 [warn]   ^
 {code}
 {code}
 [warn] 
 /scratch/rxin/spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:386:
  method connect in class IOManager is deprecated: use the new implementation 
 in package akka.io instead
 [warn]   override def preStart = IOManager(context.system).connect(new 
 InetSocketAddress(port))
 [warn] ^
 [warn] 
 /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:207:
  non-variable type argument String in type pattern Map[String,Any] is 
 unchecked since it is eliminated by erasure
 [warn]   case (key: String, struct: Map[String, Any]) = {
 [warn]  ^
 [warn] 
 /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238:
  non-variable type argument String in type pattern 
 java.util.Map[String,Object] is unchecked since it is eliminated by erasure
 [warn] case map: java.util.Map[String, Object] =
 [warn] ^
 [warn] 
 /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:243:
  non-variable type argument Object in type pattern java.util.List[Object] is 
 unchecked since it is eliminated by erasure
 [warn] case list: java.util.List[Object] =
 [warn]  ^
 [warn] 
 /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:323:
  non-variable type argument String in type pattern Map[String,Any] is 
 unchecked since it is eliminated by erasure
 [warn]   case value: Map[String, Any] = toJsonObjectString(value)
 [warn]   ^
 [info] Compiling 2 Scala sources to 
 /scratch/rxin/spark/repl/target/scala-2.10/test-classes...
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:382:
  method mapWith in class RDD is deprecated: use mapPartitionsWithIndex
 [warn] val randoms = ones.mapWith(
 [warn]^
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:400:
  method flatMapWith in class RDD is deprecated: use mapPartitionsWithIndex 
 and flatMap
 [warn] val randoms = ones.flatMapWith(
 [warn]^
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:421:
  method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and 
 filter
 [warn] val sample = ints.filterWith(
 [warn]   ^
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:76:
  method mapWith in class RDD is deprecated: use mapPartitionsWithIndex
 [warn] x.mapWith(x = x.toString)((x,y)=x + uc.op(y))
 [warn]   ^
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:82:
  method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and 
 filter
 [warn] x.filterWith(x = x.toString)((x,y)=uc.pred(y))
 [warn]   ^
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/VectorSuite.scala:29:
  class Vector in package util is deprecated: Use Vectors.dense from Spark's 
 mllib.linalg package instead.
 [warn]   def verifyVector(vector: Vector, expectedLength: Int) = {
 [warn]^
 [warn] one warning found
 {code}
 {code}
 [warn] 
 /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238:
  non-variable type argument String in type pattern 
 java.util.Map[String,Object] is unchecked since it is eliminated by

[jira] [Commented] (SPARK-3359) `sbt/sbt unidoc` doesn't work with Java 8

2014-09-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150778#comment-14150778
 ] 

Sean Owen commented on SPARK-3359:
--

Yeah I noticed this. The problem is that {{sbt-unidoc}} uses {{genjavadoc}}, 
and it looks like it generates invalid Java like the snippet you quote 
(top-level classes can't be private). That's almost all of the extra warnings.

It seems to have been fixed in {{genjavadoc}} 0.8:
https://github.com/typesafehub/genjavadoc/blob/v0.8/plugin/src/main/scala/com/typesafe/genjavadoc/AST.scala#L107

I can see how to update the plugin in the Maven build, but not yet in the SBT 
build. If someone who gets SBT can explain how to set 
{{unidocGenjavadocVersion}} to 0.8 in the {{genjavadocSettings}} that is 
inherited in {{project/SparkBuild.scala}}, I bet that would fix it.

https://github.com/sbt/sbt-unidoc/blob/master/src/main/scala/sbtunidoc/Plugin.scala#L22

 `sbt/sbt unidoc` doesn't work with Java 8
 -

 Key: SPARK-3359
 URL: https://issues.apache.org/jira/browse/SPARK-3359
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Priority: Minor

 It seems that Java 8 is stricter on JavaDoc. I got many error messages like
 {code}
 [error] 
 /Users/meng/src/spark-mengxr/core/target/java/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.java:2:
  error: modifier private not allowed here
 [error] private abstract interface SparkHadoopMapRedUtil {
 [error]  ^
 {code}
 This is minor because we can always use Java 6/7 to generate the doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3714) Spark workflow scheduler

2014-09-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151422#comment-14151422
 ] 

Sean Owen commented on SPARK-3714:
--

Another meta-question for everyone: at what point should a project like this 
simply be a separate add-on project? For example, Oozie is a stand-alone 
project. Not everything needs to happen directly under the Spark umbrella, 
which is already broad. One upside to including it is that Spark is it perhaps 
gets more attention. Spark is forced to maintain and keep it compatible, which 
is also a downside I suppose. There is also the effect that you create an 
official workflow engine and discourage others.

I am more asking the question than suggesting an answer, but, my reaction was 
that this could live outside Spark just fine.

 Spark workflow scheduler
 

 Key: SPARK-3714
 URL: https://issues.apache.org/jira/browse/SPARK-3714
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Egor Pakhomov
Priority: Minor

 [Design doc | 
 https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing]
 Spark stack currently hard to use in the production processes due to the lack 
 of next features:
 * Scheduling spark jobs
 * Retrying failed spark job in big pipeline
 * Share context among jobs in pipeline
 * Queue jobs
 Typical usecase for such platform would be - wait for new data, process new 
 data, learn ML models on new data, compare model with previous one, in case 
 of success - rewrite model in HDFS directory for current production model 
 with new one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3274) Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream

2014-09-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151432#comment-14151432
 ] 

Sean Owen commented on SPARK-3274:
--

I don't think that's the same thing. It is just saying you are reading a 
{{SequenceFile}} of {{Text}} and then pretending they are Strings. Are you sure 
the first {{return}} statement works? 

They will both work as expected if you just call {{.toString()}} on the 
{{Text}} objects you are actually operating on.

 Spark Streaming Java API reports java.lang.ClassCastException when calling 
 collectAsMap on JavaPairDStream
 --

 Key: SPARK-3274
 URL: https://issues.apache.org/jira/browse/SPARK-3274
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2
Reporter: Jack Hu

 Reproduce code:
 scontext
   .socketTextStream(localhost, 1)
   .mapToPair(new PairFunctionString, String, String(){
   public Tuple2String, String call(String arg0)
   throws Exception {
   return new Tuple2String, String(1, arg0);
   }
   })
   .foreachRDD(new Function2JavaPairRDDString, String, Time, 
 Void() {
   public Void call(JavaPairRDDString, String v1, Time 
 v2) throws Exception {
   System.out.println(v2.toString() + :  + 
 v1.collectAsMap().toString());
   return null;
   }
   });
 Exception:
 java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
 [Lscala.Tupl
 e2;
 at 
 org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.s
 cala:447)
 at 
 org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala:
 464)
 at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:90)
 at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:88)
 at 
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR
 DD$2.apply(JavaDStreamLike.scala:282)
 at 
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR
 DD$2.apply(JavaDStreamLike.scala:282)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mc
 V$sp(ForEachDStream.scala:41)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo
 rEachDStream.scala:40)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo
 rEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
 at 
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2159) Spark shell exit() does not stop SparkContext

2014-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2159.
--
   Resolution: Won't Fix
Fix Version/s: (was: 1.2.0)

The discussion in the PR suggests this is WontFix. 
https://github.com/apache/spark/pull/1230#issuecomment-54045637

 Spark shell exit() does not stop SparkContext
 -

 Key: SPARK-2159
 URL: https://issues.apache.org/jira/browse/SPARK-2159
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Priority: Minor

 If you type exit() in spark shell, it is equivalent to a Ctrl+C and does 
 not stop the SparkContext. This is used very commonly to exit a shell, and it 
 would be good if it is equivalent to Ctrl+D instead, which does stop the 
 SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2643) Stages web ui has ERROR when pool name is None

2014-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2643.
--
Resolution: Fixed

Discussion suggests this was fixed by a related change: 
https://github.com/apache/spark/pull/1854#issuecomment-55061571

 Stages web ui has ERROR when pool name is None
 --

 Key: SPARK-2643
 URL: https://issues.apache.org/jira/browse/SPARK-2643
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: YanTang Zhai
Priority: Minor

 14/07/23 16:01:44 WARN servlet.ServletHandler: /stages/
 java.util.NoSuchElementException: None.get
 at scala.None$.get(Option.scala:313)
 at scala.None$.get(Option.scala:311)
 at 
 org.apache.spark.ui.jobs.StageTableBase.stageRow(StageTable.scala:132)
 at 
 org.apache.spark.ui.jobs.StageTableBase.org$apache$spark$ui$jobs$StageTableBase$$renderStageRow(StageTable.scala:150)
 at 
 org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52)
 at 
 org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52)
 at 
 org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61)
 at 
 org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61)
 at 
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
 at 
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
 at 
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
 at 
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
 at 
 scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
 at 
 scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
 at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:38)
 at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:40)
 at 
 org.apache.spark.ui.jobs.StageTableBase.stageTable(StageTable.scala:60)
 at 
 org.apache.spark.ui.jobs.StageTableBase.toNodeSeq(StageTable.scala:52)
 at 
 org.apache.spark.ui.jobs.JobProgressPage.render(JobProgressPage.scala:91)
 at 
 org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65)
 at 
 org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65)
 at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:70)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
 at 
 org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
 at 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
 at 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
 at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:370)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
 at 
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at 
 org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at

[jira] [Resolved] (SPARK-1208) after some hours of working the :4040 monitoring UI stops working.

2014-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1208.
--
Resolution: Fixed

This appears to be a similar, if not the same issue, as in SPARK-2643. The 
discussion in the PR indicates this was resolved by a subsequent change: 
https://github.com/apache/spark/pull/1854#issuecomment-55061571

 after some hours of working the :4040 monitoring UI stops working.
 --

 Key: SPARK-1208
 URL: https://issues.apache.org/jira/browse/SPARK-1208
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 0.9.0
Reporter: Tal Sliwowicz

 This issue is inconsistent, but it did not exist in prior versions.
 The Driver app otherwise works normally.
 The log file below is from the driver.
 2014-03-09 07:24:55,837 WARN  [qtp1187052686-17453] AbstractHttpConnection  - 
 /stages/
 java.util.NoSuchElementException: None.get
 at scala.None$.get(Option.scala:313)
 at scala.None$.get(Option.scala:311)
 at 
 org.apache.spark.ui.jobs.StageTable.org$apache$spark$ui$jobs$StageTable$$stageRow(StageTable.scala:114)
 at 
 org.apache.spark.ui.jobs.StageTable$$anonfun$toNodeSeq$1.apply(StageTable.scala:39)
 at 
 org.apache.spark.ui.jobs.StageTable$$anonfun$toNodeSeq$1.apply(StageTable.scala:39)
 at 
 org.apache.spark.ui.jobs.StageTable$$anonfun$stageTable$1.apply(StageTable.scala:57)
 at 
 org.apache.spark.ui.jobs.StageTable$$anonfun$stageTable$1.apply(StageTable.scala:57)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at org.apache.spark.ui.jobs.StageTable.stageTable(StageTable.scala:57)
 at org.apache.spark.ui.jobs.StageTable.toNodeSeq(StageTable.scala:39)
 at org.apache.spark.ui.jobs.IndexPage.render(IndexPage.scala:81)
 at 
 org.apache.spark.ui.jobs.JobProgressUI$$anonfun$getHandlers$3.apply(JobProgressUI.scala:59)
 at 
 org.apache.spark.ui.jobs.JobProgressUI$$anonfun$getHandlers$3.apply(JobProgressUI.scala:59)
 at org.apache.spark.ui.JettyUtils$$anon$1.handle(JettyUtils.scala:61)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1040)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:976)
 at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at 
 org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
 at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:363)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:483)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:920)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:982)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635)
 at 
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at 
 org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:628)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:662)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3203) ClassNotFoundException in spark-shell with Cassandra

2014-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3203:
-
Summary: ClassNotFoundException in spark-shell with Cassandra  (was: 
ClassNotFound Exception)

 ClassNotFoundException in spark-shell with Cassandra
 

 Key: SPARK-3203
 URL: https://issues.apache.org/jira/browse/SPARK-3203
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
 Environment: Ubuntu 12.04, openjdk 64 bit 7u65
Reporter: Rohit Kumar

 I am using Spark with as processing engine over cassandra. I have only one 
 master and a worker node. 
 I am executing   following code in spark-shell :
 sc.stop
  import org.apache.spark.SparkContext
  import org.apache.spark.SparkConf
 import com.datastax.spark.connector._
 val conf = new SparkConf(true).set(spark.cassandra.connection.host, 
 127.0.0.1)
 val sc = new SparkContext(spark://L-BXP44Z1:7077, Cassandra Connector 
 Test, conf)
  val rdd = sc.cassandraTable(test, kv)
 println(rdd.map(_.getInt(value)).sum) 
 I am getting following error:
 14/08/25 18:47:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 14/08/25 18:49:39 INFO CoarseGrainedExecutorBackend: Got assigned task 0
 14/08/25 18:49:39 INFO Executor: Running task ID 0
 14/08/25 18:49:39 ERROR Executor: Exception in task ID 0
 java.lang.ClassNotFoundException: 
 $line29.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:270)
   at 
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60)
   at 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at

[jira] [Updated] (SPARK-1381) Spark to Shark direct streaming

2014-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1381:
-
Priority: Major  (was: Blocker)

It sounds like this is WontFix at this point, if there was a problem to begin 
with, as Shark is deprecated.

 Spark to Shark direct streaming
 ---

 Key: SPARK-1381
 URL: https://issues.apache.org/jira/browse/SPARK-1381
 Project: Spark
  Issue Type: Question
  Components: Documentation, Examples, Input/Output, Java API, Spark 
 Core
Affects Versions: 0.8.1
Reporter: Abhishek Tripathi
  Labels: performance

 Hi,
 I'm trying to push data coming from Spark streaming to Shark cache table. 
 I thought of using JDBC API but Shark(0.81) does not support direct insert 
 statement  i.e insert into emp values(2, Apia)  .
 I don't want to store Spark streaming into HDFS  and then copy that data to 
 Shark table.
 Can somebody plz help
 1.  how can I directly point Spark streaming data to Shark table/cachedTable 
 ? otherway how can Shark pickup data directly from Spark streaming? 
 2. Does Shark0.81 has direct insert statement without referring to other 
 table?
 It is really stopping us to use Spark further more. need your assistant 
 urgently.
 Thanks in advance.
 Abhishek



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1381) Spark to Shark direct streaming

2014-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1381.
--
Resolution: Won't Fix

 Spark to Shark direct streaming
 ---

 Key: SPARK-1381
 URL: https://issues.apache.org/jira/browse/SPARK-1381
 Project: Spark
  Issue Type: Question
  Components: Documentation, Examples, Input/Output, Java API, Spark 
 Core
Affects Versions: 0.8.1
Reporter: Abhishek Tripathi
  Labels: performance

 Hi,
 I'm trying to push data coming from Spark streaming to Shark cache table. 
 I thought of using JDBC API but Shark(0.81) does not support direct insert 
 statement  i.e insert into emp values(2, Apia)  .
 I don't want to store Spark streaming into HDFS  and then copy that data to 
 Shark table.
 Can somebody plz help
 1.  how can I directly point Spark streaming data to Shark table/cachedTable 
 ? otherway how can Shark pickup data directly from Spark streaming? 
 2. Does Shark0.81 has direct insert statement without referring to other 
 table?
 It is really stopping us to use Spark further more. need your assistant 
 urgently.
 Thanks in advance.
 Abhishek



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1313) Shark- JDBC driver

2014-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1313:
-
  Priority: Minor  (was: Blocker)
Issue Type: Question  (was: Task)

 Shark- JDBC driver 
 ---

 Key: SPARK-1313
 URL: https://issues.apache.org/jira/browse/SPARK-1313
 Project: Spark
  Issue Type: Question
  Components: Documentation, Examples, Java API
Reporter: Abhishek Tripathi
Priority: Minor
  Labels: Hive,JDBC, Shark,

 Hi, 
 I'm trying to get JDBC/any driver that can connect to Shark using Java  and 
 execute the Shark/hive query. Can you plz advise if such connector/driver is 
 available ?
 Thanks
 Abhishek



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1313) Shark- JDBC driver

2014-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1313.
--
Resolution: Not a Problem

This looks like it was a question more than anything, and was answered.

 Shark- JDBC driver 
 ---

 Key: SPARK-1313
 URL: https://issues.apache.org/jira/browse/SPARK-1313
 Project: Spark
  Issue Type: Question
  Components: Documentation, Examples, Java API
Reporter: Abhishek Tripathi
Priority: Minor
  Labels: Hive,JDBC, Shark,

 Hi, 
 I'm trying to get JDBC/any driver that can connect to Shark using Java  and 
 execute the Shark/hive query. Can you plz advise if such connector/driver is 
 available ?
 Thanks
 Abhishek



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1884) Shark failed to start

2014-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1884.
--
Resolution: Won't Fix

This appears to be a protobuf version mismatch, which suggests Shark is being 
used with an unsupported version of Hadoop. As Shark is deprecated and unlikely 
to take steps to support anything else -- and because there is a sort of clear 
path to workaround here if one cared to -- I think this is a WontFix too?

 Shark failed to start
 -

 Key: SPARK-1884
 URL: https://issues.apache.org/jira/browse/SPARK-1884
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.1
 Environment: ubuntu 14.04, spark 0.9.1, hive 0.13.0, hadoop 2.4.0 
 (stand alone), scala 2.11.0
Reporter: Wei Cui
Priority: Blocker

 the hadoop, hive, spark works fine.
 when start the shark, it failed with the following messages:
 Starting the Shark Command Line Client
 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.input.dir.recursive 
 is deprecated. Instead, use 
 mapreduce.input.fileinputformat.input.dir.recursive
 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.max.split.size is 
 deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.min.split.size is 
 deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
 14/05/19 16:47:21 INFO Configuration.deprecation: 
 mapred.min.split.size.per.rack is deprecated. Instead, use 
 mapreduce.input.fileinputformat.split.minsize.per.rack
 14/05/19 16:47:21 INFO Configuration.deprecation: 
 mapred.min.split.size.per.node is deprecated. Instead, use 
 mapreduce.input.fileinputformat.split.minsize.per.node
 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.reduce.tasks is 
 deprecated. Instead, use mapreduce.job.reduces
 14/05/19 16:47:21 INFO Configuration.deprecation: 
 mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.reduce.speculative
 14/05/19 16:47:22 WARN conf.Configuration: 
 org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
 override final parameter: mapreduce.job.end-notification.max.retry.interval;  
 Ignoring.
 14/05/19 16:47:22 WARN conf.Configuration: 
 org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
 override final parameter: mapreduce.cluster.local.dir;  Ignoring.
 14/05/19 16:47:22 WARN conf.Configuration: 
 org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
 override final parameter: mapreduce.job.end-notification.max.attempts;  
 Ignoring.
 14/05/19 16:47:22 WARN conf.Configuration: 
 org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
 override final parameter: mapreduce.cluster.temp.dir;  Ignoring.
 Logging initialized using configuration in 
 jar:file:/usr/local/shark/lib_managed/jars/edu.berkeley.cs.shark/hive-common/hive-common-0.11.0-shark-0.9.1.jar!/hive-log4j.properties
 Hive history 
 file=/tmp/root/hive_job_log_root_14857@ubuntu_201405191647_897494215.txt
 6.004: [GC 279616K-18440K(1013632K), 0.0438980 secs]
 6.445: [Full GC 59125K-7949K(1013632K), 0.0685160 secs]
 Reloading cached RDDs from previous Shark sessions... (use -skipRddReload 
 flag to skip reloading)
 7.535: [Full GC 104136K-13059K(1013632K), 0.0885820 secs]
 8.459: [Full GC 61237K-18031K(1013632K), 0.0820400 secs]
 8.662: [Full GC 29832K-8958K(1013632K), 0.0869700 secs]
 8.751: [Full GC 13433K-8998K(1013632K), 0.0856520 secs]
 10.435: [Full GC 72246K-14140K(1013632K), 0.1797530 secs]
 Exception in thread main org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.RuntimeException: Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient
   at 
 org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1072)
   at shark.memstore2.TableRecovery$.reloadRdds(TableRecovery.scala:49)
   at shark.SharkCliDriver.init(SharkCliDriver.scala:283)
   at shark.SharkCliDriver$.main(SharkCliDriver.scala:162)
   at shark.SharkCliDriver.main(SharkCliDriver.scala)
 Caused by: java.lang.RuntimeException: Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient
   at 
 org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1139)
   at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:51)
   at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:61)
   at 
 org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2288)
   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2299)
   at 
 org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1070)
   ... 4 more
 Caused by:

[jira] [Commented] (SPARK-3725) Link to building spark returns a 404

2014-09-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152050#comment-14152050
 ] 

Sean Owen commented on SPARK-3725:
--

Yes of course, it's already in the repo and has been for a while. It was just 
renamed with a redirect from the old URL. But, that update hasn't hit the 
public site yet.

 Link to building spark returns a 404
 

 Key: SPARK-3725
 URL: https://issues.apache.org/jira/browse/SPARK-3725
 Project: Spark
  Issue Type: Documentation
Reporter: Anant Daksh Asthana
Priority: Minor
   Original Estimate: 1m
  Remaining Estimate: 1m

 The README.md link to Building Spark returns a 404



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3725) Link to building spark returns a 404

2014-09-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152138#comment-14152138
 ] 

Sean Owen commented on SPARK-3725:
--

No, that links to the raw markdown. Truly, the fix is to rebuild the site. The 
source is fine.

 Link to building spark returns a 404
 

 Key: SPARK-3725
 URL: https://issues.apache.org/jira/browse/SPARK-3725
 Project: Spark
  Issue Type: Documentation
Reporter: Anant Daksh Asthana
Priority: Minor
   Original Estimate: 1m
  Remaining Estimate: 1m

 The README.md link to Building Spark returns a 404



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3730) Any one else having building spark recently

2014-09-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152140#comment-14152140
 ] 

Sean Owen commented on SPARK-3730:
--

(The profile is hadoop-2.3 but that's not the issue.)
I have seen this too and it's a {{scalac}} bug as far as I can tell, as you can 
see from the stack trace. It's not a Spark issue.

 Any one else having building spark recently
 ---

 Key: SPARK-3730
 URL: https://issues.apache.org/jira/browse/SPARK-3730
 Project: Spark
  Issue Type: Question
Reporter: Anant Daksh Asthana
Priority: Minor

 I get an assertion error in 
 spark/core/src/main/scala/org/apache/spark/HttpServer.scala while trying to 
 build.
 I am building using 
 mvn -Pyarn -PHadoop-2.3 -DskipTests -Phive clean package
 Here is the error i get http://pastebin.com/Shi43r53



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-09-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152912#comment-14152912
 ] 

Sean Owen commented on SPARK-3732:
--

FWIW, I was also surprised in the past that there is no way to submit a job 
programmatically. That would be great for embedding Spark. 

An option seems like big overkill here. System.exit() is not a great idea in 
general and I agree that removing it is better. I can confirm that the JVM exit 
status is 0 on success and 1 on exception anyway, so this doesn't even change 
semantics. That is, you still get exit status 1 if an exception is thrown. The 
stack trace is also printed, so the printing the exception is also redundant 
and the try block can go.

 Yarn Client: Add option to NOT System.exit() at end of main()
 -

 Key: SPARK-3732
 URL: https://issues.apache.org/jira/browse/SPARK-3732
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Sotos Matzanas
   Original Estimate: 1h
  Remaining Estimate: 1h

 We would like to add the ability to create and submit Spark jobs 
 programmatically via Scala/Java. We have found a way to hack this and submit 
 jobs via Yarn, but since 
 org.apache.spark.deploy.yarn.Client.main()
 exits with either 0 or 1 in the end, this will mean exit of our own program. 
 We would like to add an optional spark conf param to NOT exit at the end of 
 the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156311#comment-14156311
 ] 

Sean Owen commented on SPARK-3764:
--

This is correct and as intended. Without any additional flags, yes, the version 
of Hadoop referenced by Spark would be 1.0.4. You should not rely on this 
though. If your app uses Spark but not Hadoop, it's not relevant as you are not 
packaging Spark or Hadoop dependencies in your app. If you use Spark and Hadoop 
APIs, you need to explicitly depend on the version of Hadoop you use on your 
cluster (but still not bundle with your app).

 Invalid dependencies of artifacts in Maven Central Repository.
 --

 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 While testing my spark applications locally using spark artifacts downloaded 
 from Maven Central, the following exception was thrown:
 {quote}
 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
 Thread[Executor task launch worker-2,5,main]
 java.lang.IncompatibleClassChangeError: Found class 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
   at 
 org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {quote}
 This is because the hadoop class {{TaskAttemptContext}} is incompatible 
 between hadoop-1 and hadoop-2.
 I guess the spark artifacts in Maven Central were built against hadoop-2 with 
 Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
 the hadoop version mismatch is happend.
 FYI:
 sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
 correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2809) update chill to version 0.5.0

2014-10-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156362#comment-14156362
 ] 

Sean Owen commented on SPARK-2809:
--

PS chill 0.5.0 is the first to support Scala 2.11, so now this is actionable.
http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22chill_2.11%22

 update chill to version 0.5.0
 -

 Key: SPARK-2809
 URL: https://issues.apache.org/jira/browse/SPARK-2809
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati
Assignee: Guoqiang Li

 First twitter chill_2.11 0.4 has to be released



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1834) NoSuchMethodError when invoking JavaPairRDD.reduce() in Java

2014-10-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156477#comment-14156477
 ] 

Sean Owen commented on SPARK-1834:
--

Weird, I can reproduce this. I have a new test case for {{JavaAPISuite}} and am 
investigating. It compiles fine but fails at runtime. I sense Scala shenanigans.

 NoSuchMethodError when invoking JavaPairRDD.reduce() in Java
 

 Key: SPARK-1834
 URL: https://issues.apache.org/jira/browse/SPARK-1834
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1
 Environment: Redhat Linux, Java 7, Hadoop 2.2, Scala 2.10.4
Reporter: John Snodgrass

 I get a java.lang.NoSuchMethod error when invoking JavaPairRDD.reduce(). Here 
 is the partial stack trace:
 Exception in thread main java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at 
 org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:39)
 at 
 org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2;
 at JavaPairRDDReduceTest.main(JavaPairRDDReduceTest.java:49)...
 I'm using Spark 0.9.1. I checked to ensure that I'm compiling with the same 
 version of Spark as I am running on the cluster. The reduce() method works 
 fine with JavaRDD, just not with JavaPairRDD. Here is a code snippet that 
 exhibits the problem: 
   ArrayListInteger array = new ArrayList();
   for (int i = 0; i  10; ++i) {
 array.add(i);
   }
   JavaRDDInteger rdd = javaSparkContext.parallelize(array);
   JavaPairRDDString, Integer testRDD = rdd.map(new 
 PairFunctionInteger, String, Integer() {
 @Override
 public Tuple2String, Integer call(Integer t) throws Exception {
   return new Tuple2( + t, t);
 }
   }).cache();
   
   testRDD.reduce(new Function2Tuple2String, Integer, Tuple2String, 
 Integer, Tuple2String, Integer() {
 @Override
 public Tuple2String, Integer call(Tuple2String, Integer arg0, 
 Tuple2String, Integer arg1) throws Exception { 
   return new Tuple2(arg0._1 + arg1._1, arg0._2 * 10 + arg0._2);
 }
   });



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-1834) NoSuchMethodError when invoking JavaPairRDD.reduce() in Java

2014-10-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156477#comment-14156477
 ] 

Sean Owen edited comment on SPARK-1834 at 10/2/14 12:46 PM:


Weird, I can reproduce this. It compiles fine but fails at runtime. Another 
example, that doesn't even use lambdas:

{code}
  @Test
  public void pairReduce() {
JavaRDDInteger rdd = sc.parallelize(Arrays.asList(1, 1, 2, 3, 5, 8, 13));
JavaPairRDDInteger,Integer pairRDD = rdd.mapToPair(
new PairFunctionInteger, Integer, Integer() {
  @Override
  public Tuple2Integer, Integer call(Integer i) {
return new Tuple2Integer, Integer(i, i + 1);
  }
});

// See SPARK-1834
Tuple2Integer, Integer reduced = pairRDD.reduce(
new Function2Tuple2Integer,Integer, Tuple2Integer,Integer, 
Tuple2Integer,Integer() {
  @Override
  public Tuple2Integer, Integer call(Tuple2Integer, Integer t1,
   Tuple2Integer, Integer t2) {
return new Tuple2Integer, Integer(t1._1() + t2._1(), t1._2() + 
t2._2());
  }
});

Assert.assertEquals(33, reduced._1().intValue());
Assert.assertEquals(40, reduced._1().intValue());
  }
{code}

but...

{code}
java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2;
{code}

I decompiled the class and it really looks like the method is there with the 
expected signature:

{code}
  public scala.Tuple2K, V 
reduce(org.apache.spark.api.java.function.Function2scala.Tuple2K, V, 
scala.Tuple2K, V, scala.Tuple2K, V);
{code}

Color me pretty confused.


was (Author: srowen):
Weird, I can reproduce this. I have a new test case for {{JavaAPISuite}} and am 
investigating. It compiles fine but fails at runtime. I sense Scala shenanigans.

 NoSuchMethodError when invoking JavaPairRDD.reduce() in Java
 

 Key: SPARK-1834
 URL: https://issues.apache.org/jira/browse/SPARK-1834
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1
 Environment: Redhat Linux, Java 7, Hadoop 2.2, Scala 2.10.4
Reporter: John Snodgrass

 I get a java.lang.NoSuchMethod error when invoking JavaPairRDD.reduce(). Here 
 is the partial stack trace:
 Exception in thread main java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at 
 org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:39)
 at 
 org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2;
 at JavaPairRDDReduceTest.main(JavaPairRDDReduceTest.java:49)...
 I'm using Spark 0.9.1. I checked to ensure that I'm compiling with the same 
 version of Spark as I am running on the cluster. The reduce() method works 
 fine with JavaRDD, just not with JavaPairRDD. Here is a code snippet that 
 exhibits the problem: 
   ArrayListInteger array = new ArrayList();
   for (int i = 0; i  10; ++i) {
 array.add(i);
   }
   JavaRDDInteger rdd = javaSparkContext.parallelize(array);
   JavaPairRDDString, Integer testRDD = rdd.map(new 
 PairFunctionInteger, String, Integer() {
 @Override
 public Tuple2String, Integer call(Integer t) throws Exception {
   return new Tuple2( + t, t);
 }
   }).cache();
   
   testRDD.reduce(new Function2Tuple2String, Integer, Tuple2String, 
 Integer, Tuple2String, Integer() {
 @Override
 public Tuple2String, Integer call(Tuple2String, Integer arg0, 
 Tuple2String, Integer arg1) throws Exception { 
   return new Tuple2(arg0._1 + arg1._1, arg0._2 * 10 + arg0._2);
 }
   });



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156581#comment-14156581
 ] 

Sean Owen commented on SPARK-3764:
--

The artifacts themselves don't contain any Hadoop code. The default disposition 
of the pom would link to Hadoop 1, but apps are not meant to depend on this 
(this is generally good Maven practice). Yes, you always need to add Hadoop 
dependencies if you use Hadoop APIs. That's not specific to Spark.

In fact, you will want to mark Spark and Hadoop as provided dependencies when 
making an app for use with spark-submit. You can use the Spark artifacts to 
build a Spark app that works with Hadoop 2 or Hadoop 1.

The instructions you see are really about creating a build of Spark itself to 
deploy on a cluster, rather than an app for Spark.

 Invalid dependencies of artifacts in Maven Central Repository.
 --

 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 While testing my spark applications locally using spark artifacts downloaded 
 from Maven Central, the following exception was thrown:
 {quote}
 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
 Thread[Executor task launch worker-2,5,main]
 java.lang.IncompatibleClassChangeError: Found class 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
   at 
 org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {quote}
 This is because the hadoop class {{TaskAttemptContext}} is incompatible 
 between hadoop-1 and hadoop-2.
 I guess the spark artifacts in Maven Central were built against hadoop-2 with 
 Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
 the hadoop version mismatch is happend.
 FYI:
 sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
 correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156723#comment-14156723
 ] 

Sean Owen commented on SPARK-3764:
--

I'm not sure what you mean. Spark compiles versus most versions of Hadoop 1 and 
2. You can see the profiles in the build that help support this. These are 
however not relevant to someone that is just building a Spark app.

 Invalid dependencies of artifacts in Maven Central Repository.
 --

 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 While testing my spark applications locally using spark artifacts downloaded 
 from Maven Central, the following exception was thrown:
 {quote}
 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
 Thread[Executor task launch worker-2,5,main]
 java.lang.IncompatibleClassChangeError: Found class 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
   at 
 org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {quote}
 This is because the hadoop class {{TaskAttemptContext}} is incompatible 
 between hadoop-1 and hadoop-2.
 I guess the spark artifacts in Maven Central were built against hadoop-2 with 
 Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
 the hadoop version mismatch is happend.
 FYI:
 sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
 correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path

2014-10-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157156#comment-14157156
 ] 

Sean Owen commented on SPARK-3769:
--

My understanding is that you execute:

{code}
sc.addFile(/opt/tom/SparkFiles.sas);
...
SparkFiles.get(SparkFiles.sas);
{code}

I would not expect the key used by remote workers must be aware of the location 
on the driver that the file came from. The path may not be absolute in all 
cases anyway. I can see the argument that it feels like both should be the same 
key but really the key being set is the file name, not path.

You don't have to parse it by hand though. Usually you might do something like 
this anyway:

{code}
File myFile = new File(args[1]);
sc.addFile(myFile.getAbsolutePath());
String fileName = myFile.getName();
...
SparkFiles.get(fileName);
{code}

AFAIK this is as intended.

 SparkFiles.get gives me the wrong fully qualified path
 --

 Key: SPARK-3769
 URL: https://issues.apache.org/jira/browse/SPARK-3769
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2, 1.1.0
 Environment: linux host, and linux grid.
Reporter: Tom Weber
Priority: Minor

 My spark pgm running on my host, (submitting work to my grid).
 JavaSparkContext sc =new JavaSparkContext(conf);
 final String path = args[1];
 sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */
 The log shows:
 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to 
 /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas
 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at 
 http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986
 those are paths on my host machine. The location that this file gets on grid 
 nodes is:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas
 While the call to get the path in my code that runs in my mapPartitions 
 function on the grid nodes is:
 String pgm = SparkFiles.get(path);
 And this returns the following string:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas
 So, am I expected to take the qualified path that was given to me and parse 
 it to get only the file name at the end, and then concatenate that to what I 
 get from the SparkFiles.getRootDirectory() call in order to get this to work?
 Or pass only the parsed file name to the SparkFiles.get method? Seems as 
 though I should be able to pass the same file specification to both 
 sc.addFile() and SparkFiles.get() and get the correct location of the file.
 Thanks,
 Tom



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3764.
--
Resolution: Not a Problem

 Invalid dependencies of artifacts in Maven Central Repository.
 --

 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 While testing my spark applications locally using spark artifacts downloaded 
 from Maven Central, the following exception was thrown:
 {quote}
 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
 Thread[Executor task launch worker-2,5,main]
 java.lang.IncompatibleClassChangeError: Found class 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
   at 
 org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {quote}
 This is because the hadoop class {{TaskAttemptContext}} is incompatible 
 between hadoop-1 and hadoop-2.
 I guess the spark artifacts in Maven Central were built against hadoop-2 with 
 Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
 the hadoop version mismatch is happend.
 FYI:
 sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
 correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3794) Building spark core fails with specific hadoop version

2014-10-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159524#comment-14159524
 ] 

Sean Owen commented on SPARK-3794:
--

The real problem here is that commons-io is not a dependency of Spark, and 
should not be, but it began to be used a few days ago in commit 
https://github.com/apache/spark/commit/cf1d32e3e1071829b152d4b597bf0a0d7a5629a2 
for SPARK-1860. So it is accidentally depending on the version of Commons IO 
brought in by third party dependencies. 

I will propose a PR that removes this usage in favor of Guava or Java APIs.

 Building spark core fails with specific hadoop version
 --

 Key: SPARK-3794
 URL: https://issues.apache.org/jira/browse/SPARK-3794
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5
Reporter: cocoatomo
  Labels: spark

 At the commit cf1d32e3e1071829b152d4b597bf0a0d7a5629a2, building spark core 
 result in compilation error when we specify some hadoop versions.
 To reproduce this issue, we should execute following command with 
 hadoop.version=1.1.0, 1.1.1, 1.1.2, 1.2.0, 1.2.1, or 2.2.0.
 {noformat}
 $ cd ./core
 $ mvn -Dhadoop.version=hadoop.version -DskipTests clean compile
 ...
 [ERROR] 
 /Users/tomohiko/MyRepos/Scala/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:720:
  value listFilesAndDirs is not a member of object 
 org.apache.commons.io.FileUtils
 [ERROR]   val files = FileUtils.listFilesAndDirs(dir, 
 TrueFileFilter.TRUE, TrueFileFilter.TRUE)
 [ERROR] ^
 {noformat}
 Because that compilation uses commons-io version 2.1 and 
 FileUtils#listFilesAndDirs method was added at commons-io version 2.2, this 
 compilation always fails.
 FileUtils#listFilesAndDirs → 
 http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/FileUtils.html#listFilesAndDirs%28java.io.File,%20org.apache.commons.io.filefilter.IOFileFilter,%20org.apache.commons.io.filefilter.IOFileFilter%29
 Because a hadoop-client in those problematic version depends on commons-io 
 2.1 not 2.4, we should have assumption that commons-io is version 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.

2014-10-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159573#comment-14159573
 ] 

Sean Owen commented on SPARK-3561:
--

I'd be interested to see a more specific motivating use case. Is this about 
using Tez for example, and where does it help to stack Spark on Tez on YARN? or 
MR2, etc. Spark Core and Tez overlap, to be sure, and I'm not sure how much 
value it adds to run one on the other. Kind of like running Oracle on MySQL or 
something. For whatever it is: is it maybe not more natural to integrate the 
feature into Spark itself?

It would be great if it this were all just a matter of one extra trait and 
interface. In practice I suspect there are a number of hidden assumptions 
throughout the code that may leak through attempts at this abstraction. 

I am definitely asking rather than asserting, curious to see more specifics 
about the upside.

 Expose pluggable architecture to facilitate native integration with 
 third-party execution environments.
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark _integrates with external resource-managing platforms_ such 
 as Apache Hadoop YARN and Mesos to facilitate 
 execution of Spark DAG in a distributed environment provided by those 
 platforms. 
 However, this integration is tightly coupled within Spark's implementation 
 making it rather difficult to introduce integration points with 
 other resource-managing platforms without constant modifications to Spark's 
 core (see comments below for more details). 
 In addition, Spark _does not provide any integration points to a third-party 
 **DAG-like** and **DAG-capable** execution environments_ native 
 to those platforms, thus limiting access to some of their native features 
 (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
 and monitoring and more) as well as specialization aspects of
 such execution environments (open source and proprietary). As an example, 
 inability to gain access to such features are starting to affect Spark's 
 viability in large scale, batch 
 and/or ETL applications. 
 Introducing a pluggable architecture would solve both of the issues mentioned 
 above ultimately benefitting Spark's technology and community by allowing it 
 to 
 venture into co-existence and collaboration with a variety of existing Big 
 Data platforms as well as the once yet to come to the market.
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - as a non-public api (@DeveloperAPI).
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via 
 master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
 implementation containing the existing code from SparkContext, thus allowing 
 current 
 (corresponding) methods of SparkContext to delegate to such implementation 
 ensuring binary and source compatibility with older versions of Spark.  
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc and pull request for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents

2014-10-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159587#comment-14159587
 ] 

Sean Owen commented on SPARK-3803:
--

I agree with your assessment. It would take some work, though not terribly 
much, to rewrite this method to correctly handle A with more than 46340 
columns. At n = 46340, the Gramian already consumes about 8.5GB of memory, so 
it's kinda getting big to realistically use in core anyway. At the least, an 
error should be raised if n is too large. Any one else think this should be 
supported though? Would be nice, but, practically helpful?

 ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
 

 Key: SPARK-3803
 URL: https://issues.apache.org/jira/browse/SPARK-3803
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Masaru Dobashi

 When I executed computePrincipalComponents method of RowMatrix, I got 
 java.lang.ArrayIndexOutOfBoundsException.
 {code}
 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
 RDDFunctions.scala:111
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161
 
 org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460)
 
 org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114)
 
 org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113)
 
 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 
 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
 scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
 scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
 
 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
 
 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
 
 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 
 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 The RowMatrix instance was generated from the result of TF-IDF like the 
 following.
 {code}
 scala val hashingTF = new HashingTF()
 scala val tf = hashingTF.transform(texts)
 scala import org.apache.spark.mllib.feature.IDF
 scala tf.cache()
 scala val idf = new IDF().fit(tf)
 scala val tfidf: RDD[Vector] = idf.transform(tf)
 scala import org.apache.spark.mllib.linalg.distributed.RowMatrix
 scala val mat = new RowMatrix(tfidf)
 scala val pc = mat.computePrincipalComponents(2)
 {code}
 I think this was because I created HashingTF instance with default 
 numFeatures and Array is used in RowMatrix#computeGramianMatrix method
 like the following.
 {code}
   /**
* Computes the Gramian matrix `A^T A`.
*/
   def computeGramianMatrix(): Matrix = {
 val n = numCols().toInt
 val nt: Int = n * (n + 1) / 2
 // Compute the upper triangular part of the gram matrix.
 val GU = rows.treeAggregate(new BDV[Double](new Array[Double](nt)))(
   seqOp = (U, v) =

[jira] [Resolved] (SPARK-3784) Support off-loading computations to a GPU

2014-10-06 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3784.
--
Resolution: Duplicate

Duplicate of SPARK-3785

 Support off-loading computations to a GPU
 -

 Key: SPARK-3784
 URL: https://issues.apache.org/jira/browse/SPARK-3784
 Project: Spark
  Issue Type: Brainstorming
  Components: MLlib
Reporter: Thomas Darimont
Priority: Minor

 Are there any plans to adding support for off-loading computations to the 
 GPU, e.g. via an open-cl binding? 
 http://www.jocl.org/
 https://code.google.com/p/javacl/
 http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2014-10-06 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160139#comment-14160139
 ] 

Sean Owen commented on SPARK-3785:
--

In broad terms, I find there are few computations that a GPU can speed up, just 
because of the overhead of getting data from the JVM into the GPU and back. It 
makes sense for large computations where the computation is disproportionately 
large compared to the data (like maybe solving a big linear system, etc.) They 
exist but are rare.

Is there something specific to Spark here? you can use any JVM-based library 
you like to do what you like in a Spark program.

 Support off-loading computations to a GPU
 -

 Key: SPARK-3785
 URL: https://issues.apache.org/jira/browse/SPARK-3785
 Project: Spark
  Issue Type: Brainstorming
  Components: MLlib
Reporter: Thomas Darimont
Priority: Minor

 Are there any plans to adding support for off-loading computations to the 
 GPU, e.g. via an open-cl binding? 
 http://www.jocl.org/
 https://code.google.com/p/javacl/
 http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

201 - 300 of 16038 matches

Mail list logo