from:"Sean Owen $JIRA$"

[jira] [Created] (SPARK-3811) More robust / standard Utils.deleteRecursively, Utils.createTempDir

2014-10-06 Thread Sean Owen (JIRA)

Sean Owen created SPARK-3811:


 Summary: More robust / standard Utils.deleteRecursively, 
Utils.createTempDir
 Key: SPARK-3811
 URL: https://issues.apache.org/jira/browse/SPARK-3811
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sean Owen
Priority: Minor


I noticed a few issues with how temp directories are created and deleted:

*Minor*

* Guava's {{Files.createTempDir()}} plus {{File.deleteOnExit()}} is used in 
many tests to make a temp dir, but {{Utils.createTempDir()}} seems to be the 
standard Spark mechanism
* Call to {{File.deleteOnExit()}} could be pushed into 
{{Utils.createTempDir()}} as well, along with this replacement.
* _I messed up the message in an exception in {{Utils}} in SPARK-3794; fixed 
here_

*Bit Less Minor*

* {{Utils.deleteRecursively()}} fails immediately if any {{IOException}} 
occurs, instead of trying to delete any remaining files and subdirectories. 
I've observed this leave temp dirs around. I suggest changing it to continue in 
the face of an exception and throw one of the possibly several exceptions that 
occur at the end.
* {{Utils.createTempDir()}} will add a JVM shutdown hook every time the method 
is called. Even if the subdir is the parent of another parent dir, since this 
check is inside the hook. However {{Utils}} manages a set of all dirs to delete 
on shutdown already, called {{shutdownDeletePaths}}. A single hook can be 
registered to delete all of these on exit. This is how Tachyon temp paths are 
cleaned up in {{TachyonBlockManager}}.

I noticed a few other things that might be changed but wanted to ask first:

* Shouldn't the set of dirs to delete be {{File}}, not just {{String}} paths?
* {{Utils}} manages the set of {{TachyonFile}} that have been registered for 
deletion, but the shutdown hook is managed in {{TachyonBlockManager}}. Should 
this logic not live together, and not in {{Utils}}? it's more specific to 
Tachyon, and looks a slight bit odd to import in such a generic place.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3828) Spark returns inconsistent results when building with different Hadoop version

2014-10-07 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161552#comment-14161552
 ] 

Sean Owen commented on SPARK-3828:
--

(Agree, although there's an interesting point in here about Spark using the old 
TextInputFormat even on newer Hadoop. I kind of assume that will persist until 
Hadoop 1.x support is dropped, rather than bother to use reflection to use the 
newer class.)

 Spark returns inconsistent results when building with different Hadoop 
 version 
 ---

 Key: SPARK-3828
 URL: https://issues.apache.org/jira/browse/SPARK-3828
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: OSX 10.9, Spark master branch
Reporter: Liquan Pei

 For text8 data at http://mattmahoney.net/dc/text8.zip. To reproduce, please 
 unzip first. 
 Spark build with different Hadoop version returns different result. 
 {code}
 val data = sc.textFile(text8)
 data.count()
 {code}
 returns 1 when built with SPARK_HADOOP_VERSION=1.0.4 and return 2 when built 
 with SPARK_HADOOP_VERSION=2.4.0. 
 Looking through the rdd code, it seems that textFile uses hadoopFile which 
 creates HadoopRDD, we should probably create newHadoopRDD when building spark 
 with SPARK_HADOOP_VERSION = 2.0.0. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2014-10-07 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163125#comment-14163125
 ] 

Sean Owen commented on SPARK-3785:
--

Sure, but a GPU isn't going to be good at general map, filter, reduce, groupBy 
operations. It can't run arbitrary functions like the JVM. I wonder how many 
use cases actually consist of enough computation that can be specialized for 
the GPU, chained together, that makes the GPU worth it. My suspicious is still 
that there are really a few wins for this use case but that they are achievable 
by just calling to the GPU from Java code. I'd love to see that this is in fact 
a way to transparently speed up a non-trivial slice of mainstream Spark use 
cases though.

 Support off-loading computations to a GPU
 -

 Key: SPARK-3785
 URL: https://issues.apache.org/jira/browse/SPARK-3785
 Project: Spark
  Issue Type: Brainstorming
  Components: MLlib
Reporter: Thomas Darimont
Priority: Minor

 Are there any plans to adding support for off-loading computations to the 
 GPU, e.g. via an open-cl binding? 
 http://www.jocl.org/
 https://code.google.com/p/javacl/
 http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3895) Scala style: Indentation of method

2014-10-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166415#comment-14166415
 ] 

Sean Owen commented on SPARK-3895:
--

To be clear, the second is correct because of line length and brace placement. 
However the PR you mention shows the opposite, changing the second into the 
first.

The style guide already covers line length and braces. If the net change is 
moving braces, I am not sure that's worth doing for its own sake, given it is 
completely non-functional. (Although it can be fixed up when nearby code is 
edited.) So is there any action that falls out from this JIRA?

 Scala style: Indentation of method
 --

 Key: SPARK-3895
 URL: https://issues.apache.org/jira/browse/SPARK-3895
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: sjk

 such as https://github.com/apache/spark/pull/2734
 {code:title=core/src/main/scala/org/apache/spark/Aggregator.scala|borderStyle=solid}
 // for example
   def combineCombinersByKey(iter: Iterator[_ : Product2[K, C]], context: 
 TaskContext)
   : Iterator[(K, C)] =
   {
 ...
   def combineValuesByKey(iter: Iterator[_ : Product2[K, V]],
  context: TaskContext): Iterator[(K, C)] = {
 {code}
 there are not conform to the 
 rule.https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
 there are so much code like this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3894) Scala style: line length increase to 160

2014-10-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166417#comment-14166417
 ] 

Sean Owen commented on SPARK-3894:
--

This is an ancient religious war. At Google, it was the 80s vs 120s, and 
eventually the 120s won after a long bitter battle (email thread really). 

I have not heard anyone argue for 160, and the best practical argument against 
it is that it makes it hard even on largeish screens to have two editors side 
by side. Even 120 does sometimes.

Even if the standard became 160, the entire code base is wrapped at 100, and 
we'd have code that is vastly 100 characters wide with a bit of 160. I 
personally think 100 is just fine.

 Scala style: line length increase to 160
 

 Key: SPARK-3894
 URL: https://issues.apache.org/jira/browse/SPARK-3894
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: sjk

 100 is shorter
 our screen is bigger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3445) Deprecate and later remove YARN alpha support

2014-10-12 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168577#comment-14168577
]

Sean Owen commented on SPARK-3445:
--

YARN alpha is not deprecated or removed yet; this JIRA is not resolved. Even if
it were deprecated it should still compile and work. This is an error
introduced by a recent change. [~andrewor14] [~andrewor] can you have a look?
this line looks like it was introduced in
https://github.com/apache/spark/commit/c4022dd52b4827323ff956632dc7623f546da937
/ SPARK-3477

Deprecate and later remove YARN alpha support
-

Key: SPARK-3445
URL: https://issues.apache.org/jira/browse/SPARK-3445
Project: Spark
Issue Type: Improvement
Components: YARN
Reporter: Patrick Wendell

This will depend a bit on both user demand and the commitment level of
maintainers, but I'd like to propose the following timeline for yarn-alpha
support.
Spark 1.2: Deprecate YARN-alpha
Spark 1.3: Remove YARN-alpha (i.e. require YARN-stable)
Since YARN-alpha is clearly identified as an alpha API, it seems reasonable
to drop support for it in a minor release. However, it does depend a bit
whether anyone uses this outside of Yahoo!, and that I'm not sure of. In the
past this API has been used and maintained by Yahoo, but they'll be migrating
soon to the stable API's.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3926) result of JavaRDD collectAsMap() is not serializable

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169383#comment-14169383
 ] 

Sean Owen commented on SPARK-3926:
--

Yeah, seems fine to just let {{MapWrapper}} implement {{Serializable}}, because 
standard Java {{Map}} implementations are as well. It's backwards-compatible so 
seems like an easy PR to submit if you like.

 result of JavaRDD collectAsMap() is not serializable
 

 Key: SPARK-3926
 URL: https://issues.apache.org/jira/browse/SPARK-3926
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.1.0
 Environment: CentOS / Spark 1.1 / Hadoop Hortonworks 2.4.0.2.1.2.0-402
Reporter: Antoine Amend

 Using the Java API, I want to collect the result of a RDDString, String as 
 a HashMap using collectAsMap function:
 MapString, String map = myJavaRDD.collectAsMap();
 This works fine, but when passing this map to another function, such as...
 myOtherJavaRDD.mapToPair(new CustomFunction(map))
 ...this leads to the following error:
 Exception in thread main org.apache.spark.SparkException: Task not 
 serializable
   at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
   at org.apache.spark.rdd.RDD.map(RDD.scala:270)
   at 
 org.apache.spark.api.java.JavaRDDLike$class.mapToPair(JavaRDDLike.scala:99)
   at org.apache.spark.api.java.JavaPairRDD.mapToPair(JavaPairRDD.scala:44)
   ../.. MY CLASS ../..
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.io.NotSerializableException: 
 scala.collection.convert.Wrappers$MapWrapper
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
   at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 This seems to be due to WrapAsJava.scala being non serializable
 ../..
   implicit def mapAsJavaMap[A, B](m: Map[A, B]): ju.Map[A, B] = m match {
 //case JConcurrentMapWrapper(wrapped) = wrapped
 case JMapWrapper(wrapped) = wrapped.asInstanceOf[ju.Map[A, B]]
 case _ = new MapWrapper(m)
   }
 ../..
 The workaround is to manually wrapper this map into another one (serialized)
 MapString, String map = myJavaRDD.collectAsMap();
 MapString, String tmp = new HashMapString, String(map);
 myOtherJavaRDD.mapToPair(new CustomFunction(tmp))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3897) Scala style: format example code

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3897.
--
Resolution: Won't Fix

Given recent discussion, and consensus to not make sweeping style changes, I 
think this is WontFix.

 Scala style: format example code
 

 Key: SPARK-3897
 URL: https://issues.apache.org/jira/browse/SPARK-3897
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: sjk

 https://github.com/apache/spark/pull/2754



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3895) Scala style: Indentation of method

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3895.
--
Resolution: Won't Fix

Given recent discussion, and consensus to not make sweeping style changes, I 
think this is WontFix.

 Scala style: Indentation of method
 --

 Key: SPARK-3895
 URL: https://issues.apache.org/jira/browse/SPARK-3895
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: sjk

 such as https://github.com/apache/spark/pull/2734
 {code:title=core/src/main/scala/org/apache/spark/Aggregator.scala|borderStyle=solid}
 // for example
   def combineCombinersByKey(iter: Iterator[_ : Product2[K, C]], context: 
 TaskContext)
   : Iterator[(K, C)] =
   {
 ...
   def combineValuesByKey(iter: Iterator[_ : Product2[K, V]],
  context: TaskContext): Iterator[(K, C)] = {
 {code}
 there are not conform to the 
 rule.https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
 there are so much code like this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3781) code style format

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3781.
--
Resolution: Won't Fix

Given recent discussion, and consensus to not make sweeping style changes, I 
think this is WontFix.

 code style format
 -

 Key: SPARK-3781
 URL: https://issues.apache.org/jira/browse/SPARK-3781
 Project: Spark
  Issue Type: Improvement
Reporter: sjk





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3890) remove redundant spark.executor.memory in doc

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169405#comment-14169405
 ] 

Sean Owen commented on SPARK-3890:
--

For some reason the PR was not linked:
https://github.com/apache/spark/pull/2745

 remove redundant spark.executor.memory in doc
 -

 Key: SPARK-3890
 URL: https://issues.apache.org/jira/browse/SPARK-3890
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: WangTaoTheTonic
Priority: Minor

 Seems like there is a redundant spark.executor.memory config item in docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3896) checkSpeculatableTasks fask quit loop, invoking checkSpeculatableTasks is expensive

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169409#comment-14169409
 ] 

Sean Owen commented on SPARK-3896:
--

Oops, my bad. I just realized that some PRs didn't link after looking at other 
recent JIRAs.

 checkSpeculatableTasks fask quit loop, invoking checkSpeculatableTasks is 
 expensive
 ---

 Key: SPARK-3896
 URL: https://issues.apache.org/jira/browse/SPARK-3896
 Project: Spark
  Issue Type: Improvement
Reporter: sjk





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169424#comment-14169424
 ] 

Sean Owen commented on SPARK-3662:
--

[~esamanas] Do you have a suggested change here, beyond just disambiguating 
imports in your example? Or a different example that doesn't involve import 
collision? It sounds like the modified example is then misunderstood to refer 
to a pandas random class, not the Python one, and that is simply a matter of 
namespace collision, and why pandas is dragged in. This example seems to fall 
down before it demonstrates anything else.

 Importing pandas breaks included pi.py example
 --

 Key: SPARK-3662
 URL: https://issues.apache.org/jira/browse/SPARK-3662
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.1.0
 Environment: Xubuntu 14.04.  Yarn cluster running on Ubuntu 12.04.
Reporter: Evan Samanas

 If I add import pandas at the top of the included pi.py example and submit 
 using spark-submit --master yarn-client, I get this stack trace:
 {code}
 Traceback (most recent call last):
   File /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py, line 
 39, in module
 count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add)
   File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 759, in reduce
 vals = self.mapPartitions(func).collect()
   File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 723, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File 
 /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, 
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 
 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: 
 org.apache.spark.api.python.PythonException (Traceback (most recent call 
 last):
   File 
 /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py,
  line 75, in main
 command = pickleSer._read_with_length(infile)
   File 
 /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py,
  line 150, in _read_with_length
 return self.loads(obj)
 ImportError: No module named algos
 {code}
 The example works fine if I move the statement from random import random 
 from the top and into the function (def f(_)) defined in the example.  Near 
 as I can tell, random is getting confused with a function of the same name 
 within pandas.algos.  
 Submitting the same script using --master local works, but gives a 
 distressing amount of random characters to stdout or stderr and messes up my 
 terminal:
 {code}
 ...
 @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J 
 @J!@J@J#@J$@J%@J@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J@J=@J@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ
AJ
 AJ
   AJ
 AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ 
 AJ!AJAJ#AJ$AJ%AJAJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ=AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23
  15:42:09 INFO SparkContext: Job finished: reduce at 
 /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 
 11.276879779 s
 J�AJ�AJ�AJ�AJ�AJ�AJ�A�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ
  BJ
 BJ
   BJ
 BJBJBJBJBJBJBJBJBJBJBJBJBJBJJBJBJBJBJ 
 BJ!BJBJ#BJ$BJ%BJBJ'BJ(BJ)BJ*BJ+BJ,BJ-BJ.BJ/BJ0BJ1BJ2BJ3BJ4BJ5BJ6BJ7BJ8BJ9BJ:BJ;BJBJ=BJBJ?BJ@Be.
 �]qJ#1a.
 �]qJX4a.
 �]qJX4a.
 �]qJ#1a.
 �]qJX4a.
 �]qJX4a.
 �]qJ#1a.
 �]qJX4a.
 �]qJX4a.
 �]qJa.
 Pi is roughly 3.146136
 {code}
 No idea if that's related, but thought I'd include it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Resolved] (SPARK-3506) 1.1.0-SNAPSHOT in docs for 1.1.0 under docs/latest

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3506.
--
   Resolution: Fixed
Fix Version/s: 1.1.1

Looks like the site has been updated, and I see no SNAPSHOT on the page.

 1.1.0-SNAPSHOT in docs for 1.1.0 under docs/latest
 --

 Key: SPARK-3506
 URL: https://issues.apache.org/jira/browse/SPARK-3506
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Jacek Laskowski
Assignee: Patrick Wendell
Priority: Trivial
 Fix For: 1.1.1


 In https://spark.apache.org/docs/latest/ there are references to 
 1.1.0-SNAPSHOT:
 * This documentation is for Spark version 1.1.0-SNAPSHOT.
 * For the Scala API, Spark 1.1.0-SNAPSHOT uses Scala 2.10.
 It should be version 1.1.0 since that's the latest released version and the 
 header tells so, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3251) Clarify learning interfaces

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169449#comment-14169449
 ] 

Sean Owen commented on SPARK-3251:
--

Is this a subset of / duplicate of SPARK-3702 now, given the discussion?

  Clarify learning interfaces
 

 Key: SPARK-3251
 URL: https://issues.apache.org/jira/browse/SPARK-3251
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0, 1.1.1
Reporter: Christoph Sawade

 *Make threshold mandatory*
 Currently, the output of predict for an example is either the score
 or the class. This side-effect is caused by clearThreshold. To
 clarify that behaviour three different types of predict (predictScore,
 predictClass, predictProbabilty) were introduced; the threshold is not
 longer optional.
 *Clarify classification interfaces*
 Currently, some functionality is spreaded over multiple models.
 In order to clarify the structure and simplify the implementation of
 more complex models (like multinomial logistic regression), two new
 classes are introduced:
 - BinaryClassificationModel: for all models that derives a binary 
 classification from a single weight vector. Comprises the tresholding 
 functionality to derive a prediction from a score. It basically captures 
 SVMModel and LogisticRegressionModel.
 - ProbabilitistClassificaitonModel: This trait defines the interface for 
 models that return a calibrated confidence score (aka probability).
 *Misc*
 - some renaming
 - add test for probabilistic output



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3480) Throws out Not a valid command 'yarn-alpha/scalastyle' in dev/scalastyle for sbt build tool during 'Running Scala style checks'

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169472#comment-14169472
 ] 

Sean Owen commented on SPARK-3480:
--

Given the discussion I suggest this is CannotReproduce?

 Throws out Not a valid command 'yarn-alpha/scalastyle' in dev/scalastyle for 
 sbt build tool during 'Running Scala style checks'
 ---

 Key: SPARK-3480
 URL: https://issues.apache.org/jira/browse/SPARK-3480
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Yi Zhou
Priority: Minor

 Symptom:
 Run ./dev/run-tests and dump outputs as following:
 SBT_MAVEN_PROFILES_ARGS=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 
 -Pkinesis-asl
 [Warn] Java 8 tests will not run because JDK version is  1.8.
 =
 Running Apache RAT checks
 =
 RAT checks passed.
 =
 Running Scala style checks
 =
 Scalastyle checks failed at following occurrences:
 [error] Expected ID character
 [error] Not a valid command: yarn-alpha
 [error] Expected project ID
 [error] Expected configuration
 [error] Expected ':' (if selecting a configuration)
 [error] Expected key
 [error] Not a valid key: yarn-alpha
 [error] yarn-alpha/scalastyle
 [error]   ^
 Possible Cause:
 I checked the dev/scalastyle, found that there are 2 parameters 
 'yarn-alpha/scalastyle' and 'yarn/scalastyle' separately,like
 echo -e q\n | sbt/sbt -Pyarn -Phadoop-0.23 -Dhadoop.version=0.23.9 
 yarn-alpha/scalastyle \
scalastyle.txt
 echo -e q\n | sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 
 yarn/scalastyle \
scalastyle.txt
 From above error message, sbt seems to complain them due to '/' separator. So 
 it can be run through after  I manually modified original ones to  
 'yarn-alpha:scalastyle' and 'yarn:scalastyle'..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3924) Upgrade to Akka version 2.3.6

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169504#comment-14169504
 ] 

Sean Owen commented on SPARK-3924:
--

I think this is a duplicate of SPARK-2707 and SPARK-2805.

 Upgrade to Akka version 2.3.6
 -

 Key: SPARK-3924
 URL: https://issues.apache.org/jira/browse/SPARK-3924
 Project: Spark
  Issue Type: Dependency upgrade
 Environment: deploy env
Reporter: Helena Edelson

 I tried every sbt in the book but can't use the latest Akka version in my 
 project with Spark. It would be great if I could.
 Also I can not use the latest Typesafe Config - 1.2.1, which would also be 
 great.
 See https://issues.apache.org/jira/browse/SPARK-2593
 This is a big change. If I have time I can do a PR.
 [~helena_e] 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169509#comment-14169509
 ] 

Sean Owen commented on SPARK-2707:
--

Can this be considered a duplicate of SPARK-2805, since that's where I see 
recent action?

 Upgrade to Akka 2.3
 ---

 Key: SPARK-2707
 URL: https://issues.apache.org/jira/browse/SPARK-2707
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Yardena

 Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray 
 features directly in the same project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1834) NoSuchMethodError when invoking JavaPairRDD.reduce() in Java

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1834.
--
Resolution: Duplicate

On another look, I'm almost sure this is the same issue as in SPARK-3266, which 
[~joshrosen] has been looking at.

 NoSuchMethodError when invoking JavaPairRDD.reduce() in Java
 

 Key: SPARK-1834
 URL: https://issues.apache.org/jira/browse/SPARK-1834
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1
 Environment: Redhat Linux, Java 7, Hadoop 2.2, Scala 2.10.4
Reporter: John Snodgrass

 I get a java.lang.NoSuchMethod error when invoking JavaPairRDD.reduce(). Here 
 is the partial stack trace:
 Exception in thread main java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at 
 org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:39)
 at 
 org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2;
 at JavaPairRDDReduceTest.main(JavaPairRDDReduceTest.java:49)...
 I'm using Spark 0.9.1. I checked to ensure that I'm compiling with the same 
 version of Spark as I am running on the cluster. The reduce() method works 
 fine with JavaRDD, just not with JavaPairRDD. Here is a code snippet that 
 exhibits the problem: 
   ArrayListInteger array = new ArrayList();
   for (int i = 0; i  10; ++i) {
 array.add(i);
   }
   JavaRDDInteger rdd = javaSparkContext.parallelize(array);
   JavaPairRDDString, Integer testRDD = rdd.map(new 
 PairFunctionInteger, String, Integer() {
 @Override
 public Tuple2String, Integer call(Integer t) throws Exception {
   return new Tuple2( + t, t);
 }
   }).cache();
   
   testRDD.reduce(new Function2Tuple2String, Integer, Tuple2String, 
 Integer, Tuple2String, Integer() {
 @Override
 public Tuple2String, Integer call(Tuple2String, Integer arg0, 
 Tuple2String, Integer arg1) throws Exception { 
   return new Tuple2(arg0._1 + arg1._1, arg0._2 * 10 + arg0._2);
 }
   });



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2493) SBT gen-idea doesn't generate correct Intellij project

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169527#comment-14169527
 ] 

Sean Owen commented on SPARK-2493:
--

Is this still an issue [~dbtsai] ? For IntelliJ, I find it much easier to point 
directly at the Maven build, and that's more the primary build system now 
anyway.

 SBT gen-idea doesn't generate correct Intellij project
 --

 Key: SPARK-2493
 URL: https://issues.apache.org/jira/browse/SPARK-2493
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: DB Tsai

 I've a clean clone of spark master repository, and I generated the
 intellij project file by sbt gen-idea as usual. There are two issues
 we have after merging SPARK-1776 (read dependencies from Maven).
 1) After SPARK-1776, sbt gen-idea will download the dependencies from
 internet even those jars are in local cache. Before merging, the
 second time we run gen-idea will not download anything but use the
 jars in cache.
 2) The tests with spark local context can not be run in the intellij.
 It will show the following exception.
 The current workaround we've are checking out any snapshot before
 merging to gen-idea, and then switch back to current master. But this
 will not work when the master deviate too much from the latest working
 snapshot.
 [ERROR] [07/14/2014 16:27:49.967] [ScalaTest-run] [Remoting] Remoting
 error: [Startup timed out] [
 akka.remote.RemoteTransportException: Startup timed out
 at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129)
 at akka.remote.Remoting.start(Remoting.scala:191)
 at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
 at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
 at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
 at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
 at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:104)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:153)
 at org.apache.spark.SparkContext.init(SparkContext.scala:202)
 at org.apache.spark.SparkContext.init(SparkContext.scala:117)
 at org.apache.spark.SparkContext.init(SparkContext.scala:132)
 at 
 org.apache.spark.mllib.util.LocalSparkContext$class.beforeAll(LocalSparkContext.scala:29)
 at 
 org.apache.spark.mllib.optimization.LBFGSSuite.beforeAll(LBFGSSuite.scala:27)
 at 
 org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
 at 
 org.apache.spark.mllib.optimization.LBFGSSuite.beforeAll(LBFGSSuite.scala:27)
 at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
 at org.apache.spark.mllib.optimization.LBFGSSuite.run(LBFGSSuite.scala:27)
 at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
 at 
 org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
 at 
 org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)
 at 
 org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)
 at 
 org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1043)
 at 
 org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:2722)
 at 
 org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1043)
 at org.scalatest.tools.Runner$.run(Runner.scala:883)
 at org.scalatest.tools.Runner.run(Runner.scala)
 at 
 org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:141)
 at 
 org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:32)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out
 after [1 milliseconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at akka.remote.Remoting.start(Remoting.scala:173)
 ... 35 more
 ]
 An exception or error caused a

[jira] [Resolved] (SPARK-2198) Partition the scala build file so that it is easier to maintain

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2198.
--
Resolution: Won't Fix

Sounds like a WontFix

 Partition the scala build file so that it is easier to maintain
 ---

 Key: SPARK-2198
 URL: https://issues.apache.org/jira/browse/SPARK-2198
 Project: Spark
  Issue Type: Task
  Components: Build
Reporter: Helena Edelson
Priority: Minor
   Original Estimate: 3h
  Remaining Estimate: 3h

 Partition to standard Dependencies, Version, Settings, Publish.scala. keeping 
 the SparkBuild clean to describe the modules and their deps so that changes 
 in versions, for example, need only be made in Version.scala, settings 
 changes such as in scalac in Settings.scala, etc.
 I'd be happy to do this ([~helena_e])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be fixed

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169549#comment-14169549
 ] 

Sean Owen commented on SPARK-1849:
--

Yes, I think there isn't a 'fix' here short of a quite different 
implementation. Hadoop's text support pretty deeply assumes UTF-8 (partly for 
speed) and the Spark implementation is just Hadoop's. This would have to 
justify rewriting all that. I think you have to treat this as binary data for 
now.

 Broken UTF-8 encoded data gets character replacements and thus can't be 
 fixed
 ---

 Key: SPARK-1849
 URL: https://issues.apache.org/jira/browse/SPARK-1849
 Project: Spark
  Issue Type: Bug
Reporter: Harry Brundage
 Attachments: encoding_test


 I'm trying to process a file which isn't valid UTF-8 data inside hadoop using 
 Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that 
 we should fix? It looks like {{HadoopRDD}} uses 
 {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I 
 believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement 
 character, \uFFFD. Some example code mimicking what {{sc.textFile}} does 
 underneath:
 {code}
 scala sc.textFile(path).collect()(0)
 res8: String = ?pple
 scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
 classOf[Text]).map(pair = pair._2.toString).collect()(0).getBytes()
 res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)
 scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
 classOf[Text]).map(pair = pair._2.getBytes).collect()(0)
 res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
 {code}
 In the above example, the first two snippets show the string representation 
 and byte representation of the example line of text. The string shows a 
 question mark for the replacement character and the bytes reveal the 
 replacement character has been swapped in by {{Text.toString}}. The third 
 snippet shows what happens if you call {{getBytes}} on the {{Text}} object 
 which comes back from hadoop land: we get the real bytes in the file out.
 Now, I think this is a bug, though you may disagree. The text inside my file 
 is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to 
 rescue and re-encode into UTF-8, because I want my application to be smart 
 like that. I think Spark should give me the raw broken string so I can 
 re-encode, but I can't get at the original bytes in order to guess at what 
 the source encoding might be, as they have already been replaced. I'm dealing 
 with data from some CDN access logs which are to put it nicely diversely 
 encoded, but I think a use case Spark should fully support. So, my suggested 
 fix, which I'd like some guidance, is to change {{textFile}} to spit out 
 broken strings by not using {{Text}}'s UTF-8 encoding.
 Further compounding this issue is that my application is actually in PySpark, 
 but we can talk about how bytes fly through to Scala land after this if we 
 agree that this is an issue at all. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1787) Build failure on JDK8 :: SBT fails to load build configuration file

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1787.
--
Resolution: Duplicate

FWIW SBT + Java 8 has worked fine for me on master for a long while, so assume 
this does not affect 1.1 or perhaps 1.0.

 Build failure on JDK8 :: SBT fails to load build configuration file
 ---

 Key: SPARK-1787
 URL: https://issues.apache.org/jira/browse/SPARK-1787
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 0.9.0
 Environment: JDK8
 Scala 2.10.X
 SBT 0.12.X
Reporter: Richard Gomes
Priority: Minor

 SBT fails to build under JDK8.
 Please find steps to reproduce the error below:
 (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ uname -a
 Linux terra 3.13-1-amd64 #1 SMP Debian 3.13.10-1 (2014-04-15) x86_64 GNU/Linux
 (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ java -version
 java version 1.8.0_05
 Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)
 (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ scala -version
 Scala code runner version 2.10.3 -- Copyright 2002-2013, LAMP/EPFL
 (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ sbt/sbt clean
 Launching sbt from sbt/sbt-launch-0.12.4.jar
 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=350m; 
 support was removed in 8.0
 [info] Loading project definition from 
 /home/rgomes/workspace/spark-0.9.1/project/project
 [info] Compiling 1 Scala source to 
 /home/rgomes/workspace/spark-0.9.1/project/project/target/scala-2.9.2/sbt-0.12/classes...
 [error] error while loading CharSequence, class file 
 '/opt/developer/jdk1.8.0_05/jre/lib/rt.jar(java/lang/CharSequence.class)' is 
 broken
 [error] (bad constant pool tag 15 at byte 1501)
 [error] error while loading Comparator, class file 
 '/opt/developer/jdk1.8.0_05/jre/lib/rt.jar(java/util/Comparator.class)' is 
 broken
 [error] (bad constant pool tag 15 at byte 5003)
 [error] two errors found
 [error] (compile:compile) Compilation failed
 Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? q



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1738) Is spark-debugger still available?

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1738.
--
Resolution: Fixed

That document was since deleted at some point anyway, and I assume the answer 
is that it does not exit.

 Is spark-debugger still available?
 --

 Key: SPARK-1738
 URL: https://issues.apache.org/jira/browse/SPARK-1738
 Project: Spark
  Issue Type: Question
  Components: Documentation
Reporter: WangTaoTheTonic
Priority: Minor

 I see the arthur branch(https://github.com/apache/spark/tree/arthur) 
 described in docs/spark-debugger.md does not exist.
 So the spark-debugger is still available? If not, should the document be 
 deleted?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1605) Improve mllib.linalg.Vector

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1605.
--
Resolution: Won't Fix

Another WontFix then?

 Improve mllib.linalg.Vector
 ---

 Key: SPARK-1605
 URL: https://issues.apache.org/jira/browse/SPARK-1605
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sandeep Singh

 We can make current Vector a wrapper around Breeze.linalg.Vector ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1573) slight modification with regards to sbt/sbt test

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1573.
--
Resolution: Won't Fix

This has been resolved insofar as the main README.md no longer has this text.

 slight modification with regards to sbt/sbt test
 

 Key: SPARK-1573
 URL: https://issues.apache.org/jira/browse/SPARK-1573
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Nishkam Ravi

 When the sources are built against a certain Hadoop version with 
 SPARK_YARN=true, the same settings seem necessary when running sbt/sbt test. 
 For example:
 SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0 SPARK_YARN=true sbt/sbt assembly
 SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0 SPARK_YARN=true sbt/sbt test
 Otherwise build errors and failing tests are seen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1479) building spark on 2.0.0-cdh4.4.0 failed

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1479.
--
Resolution: Won't Fix

Given discussion in SPARK-3445, I doubt anything more will be done for YARN 
alpha support, as it's on its way out.

 building spark on 2.0.0-cdh4.4.0 failed
 ---

 Key: SPARK-1479
 URL: https://issues.apache.org/jira/browse/SPARK-1479
 Project: Spark
  Issue Type: Question
 Environment: 2.0.0-cdh4.4.0
 Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
 spark 0.9.1
 java version 1.6.0_32
Reporter: jackielihf
 Attachments: mvn.log


 [INFO] 
 
 [ERROR] Failed to execute goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile (scala-compile-first) on 
 project spark-yarn-alpha_2.10: Execution scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed. CompileFailed - 
 [Help 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute 
 goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile 
 (scala-compile-first) on project spark-yarn-alpha_2.10: Execution 
 scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed.
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:225)
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
   at 
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
   at 
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
   at 
 org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
   at 
 org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
   at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
   at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
   at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
   at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
 Caused by: org.apache.maven.plugin.PluginExecutionException: Execution 
 scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed.
   at 
 org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:110)
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
   ... 19 more
 Caused by: Compilation failed
   at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:76)
   at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:35)
   at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:29)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply$mcV$sp(AggressiveCompile.scala:71)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:71)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:71)
   at 
 sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:101)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4.compileScala$1(AggressiveCompile.scala:70)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4.apply(AggressiveCompile.scala:88)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4.apply(AggressiveCompile.scala:60)
   at 
 sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:24)
   at 
 sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:22)
   at sbt.inc.Incremental$.cycle(Incremental.scala:40)
   at sbt.inc.Incremental$.compile(Incremental.scala:25)
   at sbt.inc.IncrementalCompile$.apply(Compile.scala:20)
   at sbt.compiler.AggressiveCompile.compile2(AggressiveCompile.scala:96)
   at

[jira] [Commented] (SPARK-1409) Flaky Test: actor input stream test in org.apache.spark.streaming.InputStreamsSuite

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169589#comment-14169589
 ] 

Sean Owen commented on SPARK-1409:
--

Since this test was removed with SPARK-2805, safe to call this closed?

 Flaky Test: actor input stream test in 
 org.apache.spark.streaming.InputStreamsSuite
 -

 Key: SPARK-1409
 URL: https://issues.apache.org/jira/browse/SPARK-1409
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Michael Armbrust
Assignee: Tathagata Das

 Here are just a few cases:
 https://travis-ci.org/apache/spark/jobs/22151827
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13709/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1398) Remove FindBugs jsr305 dependency

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1398.
--
Resolution: Won't Fix

From the PR discussion, this had to be reverted because of some build 
problems, so I assume removing this .jar is a WontFix

 Remove FindBugs jsr305 dependency
 -

 Key: SPARK-1398
 URL: https://issues.apache.org/jira/browse/SPARK-1398
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Mark Hamstra
Assignee: Mark Hamstra
Priority: Minor

 We're not making much use of FindBugs at this point, but findbugs-2.0.x is a 
 drop-in replacement for 1.3.9 and does offer significant improvements 
 (http://findbugs.sourceforge.net/findbugs2.html), so it's probably where we 
 want to be for Spark 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1339) Build error: org.eclipse.paho:mqtt-client

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1339.
--
Resolution: Not a Problem

 Build error: org.eclipse.paho:mqtt-client
 -

 Key: SPARK-1339
 URL: https://issues.apache.org/jira/browse/SPARK-1339
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.0
Reporter: Ken Williams

 Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded.  
 The Maven error is:
 {code}
 [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
 resolve dependencies for project 
 org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
 artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
 {code}
 My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
 Is there an additional Maven repository I should add or something?
 If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
 {{examples}} modules, the build succeeds.  I'm fine without the MQTT stuff, 
 but I would really like to get the examples working because I haven't played 
 with Spark before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1317) sbt doesn't work for building Spark programs

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169629#comment-14169629
 ] 

Sean Owen commented on SPARK-1317:
--

PS if you're still interested in this, I am pretty sure #1 is the correct 
answer. I would use my own sbt (or really, the SBT support in my IDE perhaps, 
or Maven) to build my own app.

 sbt doesn't work for building Spark programs
 

 Key: SPARK-1317
 URL: https://issues.apache.org/jira/browse/SPARK-1317
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 0.9.0
Reporter: Diana Carroll

 I don't know if this is a doc bug or a product bug, because I don't know how 
 it is supposed to work.
 The Spark quick start guide page has a section that walks you through 
 creating a standalone Spark app in Scala.  I think the instructions worked 
 in 0.8.1 but I can't get them to work in 0.9.0.
 The instructions have you create a directory structure in the canonical sbt 
 format, but do not tell you where to locate this directory.  However, after 
 setting up the structure, the tutorial then instructs you to use the command 
 {code}sbt/sbt package{code}
 which implies that the working directory must be SPARK_HOME.
 I tried it both ways: creating a mysparkapp directory right in SPARK_HOME 
 and creating it in my home directory.  Neither worked, with different results:
 - if I create a mysparkapp directory as instructed in SPARK_HOME, cd to 
 SPARK_HOME and run the command sbt/sbt package as specified, it packages ALL 
 of Spark...but does not build my own app.
 - if I create a mysparkapp directory elsewhere, cd to that directory, and 
 run the command there, I get an error:
 {code}
 $SPARK_HOME/sbt/sbt package
 awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for 
 reading (No such file or directory)
 Attempting to fetch sbt
 /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
 directory
 /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
 directory
 Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please 
 install sbt manually from http://www.scala-sbt.org/
 {code}
 So, either:
 1: the Spark distribution of sbt can only be used to build Spark itself, not 
 you own code...in which case the quick start guide is wrong, and should 
 instead say that users should install sbt separately
 OR
 2: the Spark distribution of sbt CAN be used, with property configuration, in 
 which case that configuration should be documented (I wasn't able to figure 
 it out, but I didn't try that hard either)
 OR
 3: the Spark distribution of sbt is *supposed* to be able to build Spark 
 apps, but is configured incorrectly in the product, in which case there's a 
 product bug rather than a doc bug
 Although this is not a show-stopper, because the obvious workaround is to 
 simply install sbt separately, I think at least updating the docs is pretty 
 high priority, because most people learning Spark start with that Quick Start 
 page, which doesn't work.
 (If it's doc issue #1, let me know, and I'll fix the docs myself.  :-) )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1243) spark compilation error

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1243.
--
Resolution: Fixed

This appears to be long since resolved by something else, perhaps a subsequent 
change to Jetty deps. I have never seen this personally, and Jenkins builds are 
fine.

 spark compilation error
 ---

 Key: SPARK-1243
 URL: https://issues.apache.org/jira/browse/SPARK-1243
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Qiuzhuang Lian

 After issuing git pull from git master, spark could not compile any longer
 Here is the error message, it seems that it is related to jetty upgrade.@rxin
  
  
  compile
 [info] Compiling 301 Scala sources and 19 Java sources to 
 E:\projects\amplab\spark\core\target\scala-2.10\classes...
 [warn] Class java.nio.channels.ReadPendingException not found - continuing 
 with a stub.
 [error] 
 [error]  while compiling: 
 E:\projects\amplab\spark\core\src\main\scala\org\apache\spark\HttpServer.scala
 [error] during phase: erasure
 [error]  library version: version 2.10.3
 [error] compiler version: version 2.10.3
 [error]   reconstructed args: -Xmax-classfile-name 120 -deprecation 
 -bootclasspath 
 C:\Java\jdk1.6.0_27\jre\lib\resources.jar;C:\Java\jdk1.6.0_27\jre\lib\rt.jar;C:\Java\jdk1.6.0_27\jre\lib\sunrsasign.jar;C:\Java\jdk1.6.0_27\jre\lib\jsse.jar;C:\Java\jdk1.6.0_27\jre\lib\jce.jar;C:\Java\jdk1.6.0_27\jre\lib\charsets.jar;C:\Java\jdk1.6.0_27\jre\lib\modules\jdk.boot.jar;C:\Java\jdk1.6.0_27\jre\classes;C:\Users\Kand\.sbt\boot\scala-2.10.3\lib\scala-library.jar
  -unchecked -classpath

[jira] [Resolved] (SPARK-1306) no instructions provided for sbt assembly with Hadoop 2.2

2014-10-13 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen resolved SPARK-1306.
--
Resolution: Fixed

I think this was obviated by subsequent changes to this documentation. SBT is
no longer the focus, but, building-spark.md now has more comprehensive
documentation on building with YARN, including these recent versions.

no instructions provided for sbt assembly with Hadoop 2.2
-

Key: SPARK-1306
URL: https://issues.apache.org/jira/browse/SPARK-1306
Project: Spark
Issue Type: Bug
Components: Documentation
Affects Versions: 0.9.0
Reporter: Diana Carroll

on the running-on-yarn.html page, in the section Building a YARN-Enabled
Assembly JAR, only the instructions for building for old Hadoop (2.0.5)
are provided. There's a comment that The build process now also supports
new YARN versions (2.2.x). See below.
However, the only mention below is a single sentence which says See Building
Spark with Maven for instructions on how to build Spark using the Maven
process. There are no instructions for building with sbt. This is different
than in prior versions of the docs, in which a whole paragraph was provided.
I'd like to see the command line to build for Hadoop 2.2 included right at
the top of the page. Also remove the bit about how it is now supported.
Hadoop 2.2 is now the norm, no longer an exception, as I see it.
Unfortunately I'm not sure exactly what the command should be. I tried this,
but got errors:
SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1234) clean up typos and grammar issues in Spark on YARN page

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1234.
--
Resolution: Won't Fix

Given the discussion in https://github.com/apache/spark/pull/130 , this was 
abandoned, but I also don't see the bad text on that page anymore anyhow. It 
probably got improved in another subsequent update.

 clean up typos and grammar issues in Spark on YARN page
 ---

 Key: SPARK-1234
 URL: https://issues.apache.org/jira/browse/SPARK-1234
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.9.0
Reporter: Diana Carroll
Priority: Minor

 The Launch spark application with yarn-client mode section of this of this 
 page has several incomplete sentences, typos, etc.etc.  
 http://spark.incubator.apache.org/docs/latest/running-on-yarn.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1192) Around 30 parameters in Spark are used but undocumented and some are having confusing name

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169649#comment-14169649
 ] 

Sean Owen commented on SPARK-1192:
--

PR is actually at https://github.com/apache/spark/pull/2312 and is misnamed. Is 
this still live though?

 Around 30 parameters in Spark are used but undocumented and some are having 
 confusing name
 --

 Key: SPARK-1192
 URL: https://issues.apache.org/jira/browse/SPARK-1192
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.0.0
Reporter: Nan Zhu
Assignee: Nan Zhu

 I grep the code in core component, I found that around 30 parameters in the 
 implementation is actually used but undocumented. By reading the source code, 
 I found that some of them are actually very useful for the user.
 I suggest to make a complete document on the parameters. 
 Also some parameters are having confusing names
 spark.shuffle.copier.threads - this parameters is to control how many threads 
 you will use when you start a Netty-based shuffle servicebut from the 
 name, we cannot get this information
 spark.shuffle.sender.port - the similar problem with the above one, when you 
 use Netty-based shuffle receiver, you will have to setup a Netty-based 
 sender...this parameter is to setup the port used by the Netty sender, but 
 the name cannot convey this information



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1149) Bad partitioners can cause Spark to hang

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1149.
--
Resolution: Fixed

Looks like Patrick merged this into master in March. It might have been fixed 
for ... 1.0?

 Bad partitioners can cause Spark to hang
 

 Key: SPARK-1149
 URL: https://issues.apache.org/jira/browse/SPARK-1149
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Bryn Keller
Priority: Minor

 While implementing a unit test for lookup, I accidentally created a situation 
 where a partitioner returned a partition number that was outside its range. 
 It should have returned 0 or 1, but in the last case, it returned a -1. 
 Rather than reporting the problem via an exception, Spark simply hangs during 
 the unit test run.
 We should catch this bad behavior by partitioners and throw an exception.
 test(lookup with bad partitioner) {
 val pairs = sc.parallelize(Array((1,2), (3,4), (5,6), (5,7)))
 val p = new Partitioner {
   def numPartitions: Int = 2
   def getPartition(key: Any): Int = key.hashCode() % 2
 }
 val shuffled = pairs.partitionBy(p)
 assert(shuffled.partitioner === Some(p))
 assert(shuffled.lookup(1) === Seq(2))
 assert(shuffled.lookup(5) === Seq(6,7))
 assert(shuffled.lookup(-1) === Seq())
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1083) Build fail

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1083.
--
Resolution: Cannot Reproduce

This looks like a git error, and is ancient at this point. I presume that since 
we have evidence that Windows builds subsequently worked, this was either a 
local problem or fixed by something else.

 Build fail
 --

 Key: SPARK-1083
 URL: https://issues.apache.org/jira/browse/SPARK-1083
 Project: Spark
  Issue Type: Bug
  Components: Build, Windows
Affects Versions: 0.7.3
Reporter: Jan Paw

 Problem with building the latest version from github.
 {code:none}[info] Loading project definition from 
 C:\Users\Jan\Documents\GitHub\incubator-s
 park\project\project
 [debug]
 [debug] Initial source changes:
 [debug] removed:Set()
 [debug] added: Set()
 [debug] modified: Set()
 [debug] Removed products: Set()
 [debug] Modified external sources: Set()
 [debug] Modified binary dependencies: Set()
 [debug] Initial directly invalidated sources: Set()
 [debug]
 [debug] Sources indirectly invalidated by:
 [debug] product: Set()
 [debug] binary dep: Set()
 [debug] external source: Set()
 [debug] All initially invalidated sources: Set()
 [debug] Copy resource mappings:
 [debug]
 java.lang.RuntimeException: Nonzero exit code (128): git clone 
 https://github.co
 m/chenkelmann/junit_xml_listener.git 
 C:\Users\Jan\.sbt\0.13\staging\5f76b43a3aca
 87b5c013\junit_xml_listener
 at scala.sys.package$.error(package.scala:27)
 at sbt.Resolvers$.run(Resolvers.scala:134)
 at sbt.Resolvers$.run(Resolvers.scala:123)
 at sbt.Resolvers$$anon$2.clone(Resolvers.scala:78)
 at 
 sbt.Resolvers$DistributedVCS$$anonfun$toResolver$1$$anonfun$apply$11$
 $anonfun$apply$5.apply$mcV$sp(Resolvers.scala:104)
 at sbt.Resolvers$.creates(Resolvers.scala:141)
 at 
 sbt.Resolvers$DistributedVCS$$anonfun$toResolver$1$$anonfun$apply$11.
 apply(Resolvers.scala:103)
 at 
 sbt.Resolvers$DistributedVCS$$anonfun$toResolver$1$$anonfun$apply$11.
 apply(Resolvers.scala:103)
 at 
 sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$3.apply(Bui
 ldLoader.scala:90)
 at 
 sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$3.apply(Bui
 ldLoader.scala:89)
 at scala.Option.map(Option.scala:145)
 at 
 sbt.BuildLoader$$anonfun$componentLoader$1.apply(BuildLoader.scala:89
 )
 at 
 sbt.BuildLoader$$anonfun$componentLoader$1.apply(BuildLoader.scala:85
 )
 at sbt.MultiHandler.apply(BuildLoader.scala:16)
 at sbt.BuildLoader.apply(BuildLoader.scala:142)
 at sbt.Load$.loadAll(Load.scala:314)
 at sbt.Load$.loadURI(Load.scala:266)
 at sbt.Load$.load(Load.scala:262)
 at sbt.Load$.load(Load.scala:253)
 at sbt.Load$.apply(Load.scala:137)
 at sbt.Load$.buildPluginDefinition(Load.scala:597)
 at sbt.Load$.buildPlugins(Load.scala:563)
 at sbt.Load$.plugins(Load.scala:551)
 at sbt.Load$.loadUnit(Load.scala:412)
 at sbt.Load$$anonfun$15$$anonfun$apply$11.apply(Load.scala:258)
 at sbt.Load$$anonfun$15$$anonfun$apply$11.apply(Load.scala:258)
 at 
 sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$
 apply$5$$anonfun$apply$6.apply(BuildLoader.scala:93)
 at 
 sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$
 apply$5$$anonfun$apply$6.apply(BuildLoader.scala:92)
 at sbt.BuildLoader.apply(BuildLoader.scala:143)
 at sbt.Load$.loadAll(Load.scala:314)
 at sbt.Load$.loadURI(Load.scala:266)
 at sbt.Load$.load(Load.scala:262)
 at sbt.Load$.load(Load.scala:253)
 at sbt.Load$.apply(Load.scala:137)
 at sbt.Load$.defaultLoad(Load.scala:40)
 at sbt.BuiltinCommands$.doLoadProject(Main.scala:451)
 at 
 sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:445)
 at 
 sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:445)
 at 
 sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.sca
 la:60)
 at 
 sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.sca
 la:60)
 at 
 sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.sca
 la:62)
 at 
 sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.sca
 la:62)
 at sbt.Command$.process(Command.scala:95)
 at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:100)
 at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:100)
 at sbt.State$$anon$1.process(State.scala:179)
 at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:100)
 at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:100)
 at

[jira] [Resolved] (SPARK-1017) Set the permgen even if we are calling the users sbt (via SBT_OPTS)

2014-10-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1017.
--
Resolution: Won't Fix

As I understand, only {{sbt/sbt}} is supported for building Spark with SBT, 
rather than a local {{sbt}}. Maven is the primary build, and it sets 
{{MaxPermSize}} and {{PermGen}} for scalac and scalatest. I think this is 
obsolete and/or already covered then?

 Set the permgen even if we are calling the users sbt (via SBT_OPTS)
 ---

 Key: SPARK-1017
 URL: https://issues.apache.org/jira/browse/SPARK-1017
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Cogan
Assignee: Patrick Cogan

 Now we will call the users sbt installation if they have one. But users might 
 run into the permgen issues... so we should force the permgen unless the user 
 explicitly overrides it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1463) cleanup unnecessary dependency jars in the spark assembly jars

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169709#comment-14169709
 ] 

Sean Owen commented on SPARK-1463:
--

FWIW I do not see these packages in the final assembly JAR anymore. This may be 
obsolete?

 cleanup unnecessary dependency jars in the spark assembly jars
 --

 Key: SPARK-1463
 URL: https://issues.apache.org/jira/browse/SPARK-1463
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 0.9.0
Reporter: Jenny MA
Priority: Minor
  Labels: easyfix
 Fix For: 1.0.0


 there are couple GPL/LGPL based dependencies which are included in the final 
 assembly jar, which are not used by spark runtime.  identified the following 
 libraries. we can provide a fix in assembly/pom.xml. 
 excludecom.google.code.findbugs:*/exclude
 excludeorg.acplt:oncrpc:*/exclude
 excludeglassfish:*/exclude
 excludecom.cenqua.clover:clover:*/exclude
 excludeorg.glassfish:*/exclude
 excludeorg.glassfish.grizzly:*/exclude
 excludeorg.glassfish.gmbal:*/exclude 
 excludeorg.glassfish.external:*/exclude
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1010) Update all unit tests to use SparkConf instead of system properties

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169724#comment-14169724
 ] 

Sean Owen commented on SPARK-1010:
--

Yes, lots of usage in tests still. A lot looks intentional.

{code}
find . -name *Suite.scala -type f -exec grep -E System\.[gs]etProperty {} \;
...
.format(System.getProperty(user.name, unknown),
.format(System.getProperty(user.name, unknown)).stripMargin
System.setProperty(spark.testing, true)
System.setProperty(spark.reducer.maxMbInFlight, 1)
System.setProperty(spark.storage.memoryFraction, 0.0001)
System.setProperty(spark.storage.memoryFraction, 0.01)
System.setProperty(spark.authenticate, false)
System.setProperty(spark.authenticate, false)
System.setProperty(spark.shuffle.manager, hash)
System.setProperty(spark.scheduler.mode, FIFO)
System.setProperty(spark.scheduler.mode, FAIR)
...
{code}


 Update all unit tests to use SparkConf instead of system properties
 ---

 Key: SPARK-1010
 URL: https://issues.apache.org/jira/browse/SPARK-1010
 Project: Spark
  Issue Type: New Feature
Affects Versions: 0.9.0
Reporter: Patrick Cogan
Assignee: Nirmal
Priority: Minor
  Labels: starter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1209) SparkHadoopUtil should not use package org.apache.hadoop

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169835#comment-14169835
 ] 

Sean Owen commented on SPARK-1209:
--

Yes, I wonder too, does SparkHadoopMapRedUtil and SparkHadoopMapReduceUtil need 
to live in {{org.apache.hadoop}} anymore? I assume they may have in the past to 
access some package-private Hadoop code. But I've tried moving them under 
{{org.apache.spark}} and compiling versus a few Hadoop versions and it all 
seems fine.

Am I missing something or is this worth changing? it's private to Spark (well, 
org.apache right now by necessity) so think it's fair game to move.

See https://github.com/srowen/spark/tree/SPARK-1209

 SparkHadoopUtil should not use package org.apache.hadoop
 

 Key: SPARK-1209
 URL: https://issues.apache.org/jira/browse/SPARK-1209
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Sandy Pérez González
Assignee: Mark Grover

 It's private, so the change won't break compatibility



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3934) RandomForest bug in sanity check in DTStatsAggregator

2014-10-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170220#comment-14170220
 ] 

Sean Owen commented on SPARK-3934:
--

Yep that fixes the issue I was seeing. Thanks! I can confirm it did not affect 
DecisionTree too, so it seems to match your analysis.

 RandomForest bug in sanity check in DTStatsAggregator
 -

 Key: SPARK-3934
 URL: https://issues.apache.org/jira/browse/SPARK-3934
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 When run with a mix of unordered categorical and continuous features, on 
 multiclass classification, RandomForest fails.  The bug is in the sanity 
 checks in getFeatureOffset and getLeftRightFeatureOffsets, which use the 
 wrong indices for checking whether features are unordered.
 Proposal: Remove the sanity checks since they are not really needed, and 
 since they would require DTStatsAggregator to keep track of an extra set of 
 indices (for the feature subset).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3895) Scala style: Indentation of method

2014-10-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170609#comment-14170609
 ] 

Sean Owen commented on SPARK-3895:
--

The rule is already in style guide. There seems to be agreement not to change 
this all at once. The original PR seemed to contradict the style guide. What 
change are you proposing in order to reopen this?

 Scala style: Indentation of method
 --

 Key: SPARK-3895
 URL: https://issues.apache.org/jira/browse/SPARK-3895
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: sjk

 {code:title=core/src/main/scala/org/apache/spark/Aggregator.scala|borderStyle=solid}
 // for example
   def combineCombinersByKey(iter: Iterator[_ : Product2[K, C]], context: 
 TaskContext)
   : Iterator[(K, C)] =
   {
 ...
   def combineValuesByKey(iter: Iterator[_ : Product2[K, V]],
  context: TaskContext): Iterator[(K, C)] = {
 {code}
 there are not conform to the 
 rule.https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
 there are so much code like this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1209) SparkHadoopUtil should not use package org.apache.hadoop

2014-10-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170617#comment-14170617
 ] 

Sean Owen commented on SPARK-1209:
--

Hm, it's {{private[apache]}} though. Couldn't this only be used by people 
writing in the {{org.apache}} namespace? naturally a project might do just this 
to access this code, but I hadn't though this was promised as an stable API. 
People that pull this trick can I suppose declare their hack in 
{{org.apache.spark}}, although that's a source change.

I can set up a forwarder and deprecate to see how that looks but wanted to 
check if it's really these classes in question that are being used outside 
Spark.

 SparkHadoopUtil should not use package org.apache.hadoop
 

 Key: SPARK-1209
 URL: https://issues.apache.org/jira/browse/SPARK-1209
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Sandy Ryza
Assignee: Mark Grover

 It's private, so the change won't break compatibility



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3480) Throws out Not a valid command 'yarn-alpha/scalastyle' in dev/scalastyle for sbt build tool during 'Running Scala style checks'

2014-10-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3480.
--
Resolution: Cannot Reproduce

 Throws out Not a valid command 'yarn-alpha/scalastyle' in dev/scalastyle for 
 sbt build tool during 'Running Scala style checks'
 ---

 Key: SPARK-3480
 URL: https://issues.apache.org/jira/browse/SPARK-3480
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Yi Zhou
Priority: Minor

 Symptom:
 Run ./dev/run-tests and dump outputs as following:
 SBT_MAVEN_PROFILES_ARGS=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 
 -Pkinesis-asl
 [Warn] Java 8 tests will not run because JDK version is  1.8.
 =
 Running Apache RAT checks
 =
 RAT checks passed.
 =
 Running Scala style checks
 =
 Scalastyle checks failed at following occurrences:
 [error] Expected ID character
 [error] Not a valid command: yarn-alpha
 [error] Expected project ID
 [error] Expected configuration
 [error] Expected ':' (if selecting a configuration)
 [error] Expected key
 [error] Not a valid key: yarn-alpha
 [error] yarn-alpha/scalastyle
 [error]   ^
 Possible Cause:
 I checked the dev/scalastyle, found that there are 2 parameters 
 'yarn-alpha/scalastyle' and 'yarn/scalastyle' separately,like
 echo -e q\n | sbt/sbt -Pyarn -Phadoop-0.23 -Dhadoop.version=0.23.9 
 yarn-alpha/scalastyle \
scalastyle.txt
 echo -e q\n | sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 
 yarn/scalastyle \
scalastyle.txt
 From above error message, sbt seems to complain them due to '/' separator. So 
 it can be run through after  I manually modified original ones to  
 'yarn-alpha:scalastyle' and 'yarn:scalastyle'..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3926) result of JavaRDD collectAsMap() is not serializable

2014-10-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171418#comment-14171418
 ] 

Sean Owen commented on SPARK-3926:
--

Oops, embarrassed to say I didn't realize {{WrapAsJava.scala}} is a Scala 
class. Can't change that.
This requires subclassing {{MapWrapper}} to add {{java.io.Serializable}}. It 
still basically seems worthwhile to support this use case, so I'll propose it 
as a PR.

 result of JavaRDD collectAsMap() is not serializable
 

 Key: SPARK-3926
 URL: https://issues.apache.org/jira/browse/SPARK-3926
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.1.0
 Environment: CentOS / Spark 1.1 / Hadoop Hortonworks 2.4.0.2.1.2.0-402
Reporter: Antoine Amend

 Using the Java API, I want to collect the result of a RDDString, String as 
 a HashMap using collectAsMap function:
 MapString, String map = myJavaRDD.collectAsMap();
 This works fine, but when passing this map to another function, such as...
 myOtherJavaRDD.mapToPair(new CustomFunction(map))
 ...this leads to the following error:
 Exception in thread main org.apache.spark.SparkException: Task not 
 serializable
   at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
   at org.apache.spark.rdd.RDD.map(RDD.scala:270)
   at 
 org.apache.spark.api.java.JavaRDDLike$class.mapToPair(JavaRDDLike.scala:99)
   at org.apache.spark.api.java.JavaPairRDD.mapToPair(JavaPairRDD.scala:44)
   ../.. MY CLASS ../..
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.io.NotSerializableException: 
 scala.collection.convert.Wrappers$MapWrapper
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
   at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 This seems to be due to WrapAsJava.scala being non serializable
 ../..
   implicit def mapAsJavaMap[A, B](m: Map[A, B]): ju.Map[A, B] = m match {
 //case JConcurrentMapWrapper(wrapped) = wrapped
 case JMapWrapper(wrapped) = wrapped.asInstanceOf[ju.Map[A, B]]
 case _ = new MapWrapper(m)
   }
 ../..
 The workaround is to manually wrapper this map into another one (serialized)
 MapString, String map = myJavaRDD.collectAsMap();
 MapString, String tmp = new HashMapString, String(map);
 myOtherJavaRDD.mapToPair(new CustomFunction(tmp))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3955) Different versions between jackson-mapper-asl and jackson-core-asl

2014-10-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172148#comment-14172148
 ] 

Sean Owen commented on SPARK-3955:
--

Looks like Jackson is managed to version 1.8.8 for Avro reasons. I think core 
just needs to be managed the same way. I'll try it locally to make sure that 
works.

 Different versions between jackson-mapper-asl and jackson-core-asl
 --

 Key: SPARK-3955
 URL: https://issues.apache.org/jira/browse/SPARK-3955
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.1.0
Reporter: Jongyoul Lee

 In the parent pom.xml, specified a version of jackson-mapper-asl. This is 
 used by sql/hive/pom.xml. When mvn assembly runs, however, jackson-mapper-asl 
 is not same as jackson-core-asl. This is because other libraries use several 
 versions of jackson, so other version of jackson-core-asl is assembled. 
 Simply, fix this problem if pom.xml has a specific version information of 
 jackson-core-asl. If it's not set, a version 1.9.11 is merged info 
 assembly.jar and we cannot use jackson library properly.
 {code}
 [INFO] Including org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8 in the 
 shaded jar.
 [INFO] Including org.codehaus.jackson:jackson-core-asl:jar:1.9.11 in the 
 shaded jar.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1209) SparkHadoopUtil should not use package org.apache.hadoop

2014-10-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172571#comment-14172571
 ] 

Sean Owen commented on SPARK-1209:
--

... and why wouldn't you, that's the title of the JIRA, oops. It's not that 
class that moves or even changes actually, and yes it should not move. Let me 
fix the title and fix my PR too. Maybe that's a more palatable change.

 SparkHadoopUtil should not use package org.apache.hadoop
 

 Key: SPARK-1209
 URL: https://issues.apache.org/jira/browse/SPARK-1209
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Sandy Ryza
Assignee: Mark Grover

 It's private, so the change won't break compatibility



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-10-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173618#comment-14173618
 ] 

Sean Owen commented on SPARK-3431:
--

Yes that should be what scalatest does. It is a fork of an old surefire so only 
has a very few options. This parallelization failed as above for a few reasons. 
I have not gotten surefire to run the scala tests 

 Parallelize execution of tests
 --

 Key: SPARK-3431
 URL: https://issues.apache.org/jira/browse/SPARK-3431
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas

 Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
 strategy to cut test time down is to parallelize the execution of the tests. 
 Doing that may in turn require some prerequisite changes to be made to how 
 certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-10-17 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175377#comment-14175377
]

Sean Owen commented on SPARK-2426:
--

Regarding licensing, if the code is BSD licensed then it does not require an
entry in NOTICE file (it's a Category A license), and entries shouldn't be
added to NOTICE unless required. I believe that in this case we will need to
reproduce the text of the license in LICENSE since it will not be included
otherwise from a Maven artifact. So I suggest: don't change NOTICE, and move
the license in LICENSE up to the section where other licenses are reproduced in
full. It's a complex issue but this is my best understanding of the right thing
to do.

Quadratic Minimization for MLlib ALS

Key: SPARK-2426
URL: https://issues.apache.org/jira/browse/SPARK-2426
Project: Spark
Issue Type: New Feature
Components: MLlib
Affects Versions: 1.0.0
Reporter: Debasish Das
Assignee: Debasish Das
Original Estimate: 504h
Remaining Estimate: 504h

Current ALS supports least squares and nonnegative least squares.
I presented ADMM and IPM based Quadratic Minimization solvers to be used for
the following ALS problems:
1. ALS with bounds
2. ALS with L1 regularization
3. ALS with Equality constraint and bounds
Initial runtime comparisons are presented at Spark Summit.
http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
Based on Xiangrui's feedback I am currently comparing the ADMM based
Quadratic Minimization solvers with IPM based QpSolvers and the default
ALS/NNLS. I will keep updating the runtime comparison results.
For integration the detailed plan is as follows:
1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
2. Integrate QuadraticMinimizer in mllib ALS

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions

2014-10-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175748#comment-14175748
 ] 

Sean Owen commented on SPARK-3967:
--

You guys should make PRs for these. I am also not sure if it's so necessary to 
download the file into a temp directory and move it... it may cause a copy 
instead of rename, and in fact does here, and so is not like the file appears 
in the target dir atomically anyway. I'm not sure the code here cleans up the 
partially downloaded file in case of error and that could leave a broken file 
in the target dir instead of just a temp dir.

The change to not copy the file when identical looks sound; I bet you can avoid 
checking if it exists twice.

 Spark applications fail in yarn-cluster mode when the directories configured 
 in yarn.nodemanager.local-dirs are located on different disks/partitions
 -

 Key: SPARK-3967
 URL: https://issues.apache.org/jira/browse/SPARK-3967
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Christophe PRÉAUD
 Attachments: spark-1.1.0-utils-fetch.patch, 
 spark-1.1.0-yarn_cluster_tmpdir.patch


 Spark applications fail from time to time in yarn-cluster mode (but not in 
 yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is 
 set to a comma-separated list of directories which are located on different 
 disks/partitions.
 Steps to reproduce:
 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of 
 directories located on different partitions (the more you set, the more 
 likely it will be to reproduce the bug):
 (...)
 property
   nameyarn.nodemanager.local-dirs/name
   
 valuefile:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir/value
 /property
 (...)
 2. Launch (several times) an application in yarn-cluster mode, it will fail 
 (apparently randomly) from time to time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4007) EOF exception to load an JavaRDD from HDFS

2014-10-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176824#comment-14176824
 ] 

Sean Owen commented on SPARK-4007:
--

Is this going to have any information associated to it? JavaRDD works fine with 
HDFS.

 EOF exception to load an JavaRDD from HDFS 
 ---

 Key: SPARK-4007
 URL: https://issues.apache.org/jira/browse/SPARK-4007
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: hadoop-client-2.30
 hadoop-hdfs-2.30
 spark-core-1.10
 spark-mllib-1.10
Reporter: Cristian Galán





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4002) JavaKafkaStreamSuite.testKafkaStream fails on OSX

2014-10-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176995#comment-14176995
 ] 

Sean Owen commented on SPARK-4002:
--

FWIW It doesn't fail for me from master right now if I run dev/run-tests. I'm 
on OS X Yosemite now (10.10)

 JavaKafkaStreamSuite.testKafkaStream fails on OSX
 -

 Key: SPARK-4002
 URL: https://issues.apache.org/jira/browse/SPARK-4002
 Project: Spark
  Issue Type: Bug
 Environment: Mac OSX 10.9.5.
Reporter: Ryan Williams

 [~sowen] mentioned this on spark-dev 
 [here|http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccamassdjs0fmsdc-k-4orgbhbfz2vvrmm0hfyifeeal-spft...@mail.gmail.com%3E]
  and I just reproduced it on {{master}} 
 ([7e63bb4|https://github.com/apache/spark/commit/7e63bb49c526c3f872619ae14e4b5273f4c535e9]).
 The relevant output I get when running {{./dev/run-tests}} is:
 {code}
 [info] KafkaStreamSuite:
 [info] - Kafka input stream
 [info] Test run started
 [info] Test 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream started
 [error] Test 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream failed: 
 junit.framework.AssertionFailedError: expected:3 but was:0
 [error] at junit.framework.Assert.fail(Assert.java:50)
 [error] at junit.framework.Assert.failNotEquals(Assert.java:287)
 [error] at junit.framework.Assert.assertEquals(Assert.java:67)
 [error] at junit.framework.Assert.assertEquals(Assert.java:199)
 [error] at junit.framework.Assert.assertEquals(Assert.java:205)
 [error] at 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream(JavaKafkaStreamSuite.java:129)
 [error] ...
 [info] Test run finished: 1 failed, 0 ignored, 1 total, 19.798s
 {code}
 Seems like this test should be {{@Ignore}}'d, or some note about this made on 
 the {{README.md}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4018) RDD.reduce failing with java.lang.ClassCastException: org.apache.spark.SparkContext$$anonfun$26 cannot be cast to scala.Function2

2014-10-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177592#comment-14177592
 ] 

Sean Owen commented on SPARK-4018:
--

Your sample code is Java, but the error seems to concern the Scala API. Are you 
sure the exception occurs on this invocation? Does it compile and then fail at 
runtime? or are you operating just in the shell?

 RDD.reduce failing with java.lang.ClassCastException: 
 org.apache.spark.SparkContext$$anonfun$26 cannot be cast to scala.Function2
 -

 Key: SPARK-4018
 URL: https://issues.apache.org/jira/browse/SPARK-4018
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1
Reporter: Haithem Turki
Priority: Critical

 Hey all,
 A simple reduce operation against Spark 1.1.1 is giving me following 
 exception:
 {code}
 14/10/20 16:27:22 ERROR executor.Executor: Exception in task 9.7 in stage 0.0 
 (TID 1001)
 java.lang.ClassCastException: org.apache.spark.SparkContext$$anonfun$26 
 cannot be cast to scala.Function2
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 My code is a relatively simple map-reduce:
 {code}
  MapString, Foo aggregateTracker = rdd.map(new MapFunction(list))
 .reduce(new ReduceFunction());
 {code}
 Where:
 - MapFunction is of type FunctionRecord, MapString, Object
 - ReduceFunction is of type Function2MapString, Foo, MapString, Foo, 
 MapString, Foo
 - list is just a list of Foo2
 Both Foo1 and Foo2 are serializable
 I've tried this with both the Java and Scala API, lines for each are:
 org.apache.spark.api.java.JavaRDD.reduce(JavaRDD.scala:32)
 org.apache.spark.rdd.RDD.reduce(RDD.scala:861)
 The thing being flagged is always: org.apache.spark.SparkContext$$anonfun$26 
 (the number doesn't change).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib

2014-10-20 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177707#comment-14177707
]

Sean Owen commented on SPARK-4001:
--

FWIW I do perceive Apriori to be *the* basic frequent itemset algorithm. I
think this is the original paper -- at least it was on Wikipedia and looks like
the right time / author:
http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf It is very simple,
and probably what you'd cook up if you invented a solution to the problem:
http://en.wikipedia.org/wiki/Apriori_algorithm

Frequent itemset is not quite the same as a frequent item algorithm. From a
bunch of sets of items, it tries to determine which subsets occur frequently.

FP-Growth is the other itemset algorithm I have ever heard of. It's more
sophisticated. I don't have a paper reference.

If you're going to implement frequent itemsets, I think these are the two to
start with. That said I perceive frequent itemsets to be kind of 90s and I
have never had to use it myself. That is not to say they don't have use, and
hey they're simple. I suppose my problem with this type of technique is that
it's not really telling you whether the set occurred unusually frequently, just
that it did in absolute terms. There is not a probabilistic element to these.

Add Apriori algorithm to Spark MLlib

Key: SPARK-4001
URL: https://issues.apache.org/jira/browse/SPARK-4001
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Jacky Li
Assignee: Jacky Li

Apriori is the classic algorithm for frequent item set mining in a
transactional data set. It will be useful if Apriori algorithm is added to
MLLib in Spark

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4021) Kinesis code can cause compile failures with newer JDK's

2014-10-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177725#comment-14177725
 ] 

Sean Owen commented on SPARK-4021:
--

This code is just fine though. The error is the kind you get from Java 5 when 
you @Override something that is not a superclass method. Here it's an interface 
method, which is perfectly fine in Java 6+. The other new warnings indicate 
that something turn on -Xlint:all but I don't see that in the build.

It seems like a Jenkins config issue and I don't know that it's anything to do 
with Java 7, from this? 7u71 doesn't seem to have any compiler changes, and 
certainly wouldn't have any breaking like this.

Are we sure Jenkins hasn't somehow located a copy of Java 5 installed somewhere?

 Kinesis code can cause compile failures with newer JDK's
 

 Key: SPARK-4021
 URL: https://issues.apache.org/jira/browse/SPARK-4021
 Project: Spark
  Issue Type: Bug
  Components: Streaming
 Environment: JDK 7u71
Reporter: Patrick Wendell

 When compiled with JDK7u71, the Spark build failed due to these issues:
 {code}
 [error] --
 [error] 1. WARNING in 
 /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  (at line 83)
 [error]   private static final Logger logger = 
 Logger.getLogger(JavaKinesisWordCountASL.class);
 [error]   ^^
 [error] The field JavaKinesisWordCountASL.logger is never read locally
 [error] --
 [error] 2. WARNING in 
 /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  (at line 151)
 [error]   JavaDStreamString words = unionStreams.flatMap(new 
 FlatMapFunctionbyte[], String() {
 [error]
 ^
 [error] The serializable class  does not declare a static final 
 serialVersionUID field of type long
 [error] --
 [error] 3. ERROR in 
 /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  (at line 153)
 [error]   public IterableString call(byte[] line) {
 [error]   ^
 [error] The method call(byte[]) of type new 
 FlatMapFunctionbyte[],String(){} must override a superclass method
 [error] --
 [error] 4. WARNING in 
 /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  (at line 160)
 [error]   new PairFunctionString, String, Integer() {
 [error]   ^^^
 [error] The serializable class  does not declare a static final 
 serialVersionUID field of type long
 [error] --
 [error] 5. ERROR in 
 /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  (at line 162)
 [error]   public Tuple2String, Integer call(String s) {
 [error]  ^^
 [error] The method call(String) of type new 
 PairFunctionString,String,Integer(){} must override a superclass method
 [error] --
 [error] 6. WARNING in 
 /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  (at line 165)
 [error]   }).reduceByKey(new Function2Integer, Integer, Integer() {
 [error]  ^^
 [error] The serializable class  does not declare a static final 
 serialVersionUID field of type long
 [error] --
 [error] 7. ERROR in 
 /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  (at line 167)
 [error]   public Integer call(Integer i1, Integer i2) {
 [error]  
 [error] The method call(Integer, Integer) of type new 
 Function2Integer,Integer,Integer(){} must override a superclass method
 [error] --
 [error] 7 problems (3 errors, 4 warnings)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe,

[jira] [Commented] (SPARK-4022) Replace colt dependency (LGPL) with commons-math

2014-10-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177759#comment-14177759
 ] 

Sean Owen commented on SPARK-4022:
--

Yeah, looks like it is only in examples at least, so I don't know if Colt ever 
got technically distributed (? I forget whether it goes out with transitive 
deps). But best to change it. I can try it, since I know Commons Math well, 
unless someone's already on it.

 Replace colt dependency (LGPL) with commons-math
 

 Key: SPARK-4022
 URL: https://issues.apache.org/jira/browse/SPARK-4022
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Patrick Wendell
Priority: Critical

 The colt library we use is LGPL-licensed:
 http://acs.lbl.gov/ACSSoftware/colt/license.html
 We need to swap this out for commons-math. It is also a very old library that 
 hasn't been updated since 2004.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4018) RDD.reduce failing with java.lang.ClassCastException: org.apache.spark.SparkContext$$anonfun$26 cannot be cast to scala.Function2

2014-10-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177766#comment-14177766
 ] 

Sean Owen commented on SPARK-4018:
--

Hm, I mean I suppose it goes without saying that the Java unit tests show that 
this does in general work, and I am successfully using JavaRDD all day myself 
with map and reduce.

I think the anonymous function here is the context cleaner function perhaps, 
but I think that's a red herring.

I noticed I don't see a org.apache.spark.SparkContext$$anonfun$26  in the byte 
code when built from master and this reminds me of a potential explanation. Are 
you building against a different version of Spark than you're running? 
anonymous function 26 could be something totally different at runtime if so.

That would explain why you aren't seeing any problem at compile time. I imagine 
it's something like this and not a problem with Spark per se.

 RDD.reduce failing with java.lang.ClassCastException: 
 org.apache.spark.SparkContext$$anonfun$26 cannot be cast to scala.Function2
 -

 Key: SPARK-4018
 URL: https://issues.apache.org/jira/browse/SPARK-4018
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1
Reporter: Haithem Turki
Priority: Critical

 Hey all,
 A simple reduce operation against Spark 1.1.1 is giving me following 
 exception:
 {code}
 14/10/20 16:27:22 ERROR executor.Executor: Exception in task 9.7 in stage 0.0 
 (TID 1001)
 java.lang.ClassCastException: org.apache.spark.SparkContext$$anonfun$26 
 cannot be cast to scala.Function2
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 My code is a relatively simple map-reduce:
 {code}
  MapString, Foo aggregateTracker = rdd.map(new MapFunction(list))
 .reduce(new ReduceFunction());
 {code}
 Where:
 - MapFunction is of type FunctionRecord, MapString, Object
 - ReduceFunction is of type Function2MapString, Foo, MapString, Foo, 
 MapString, Foo
 - list is just a list of Foo2
 Both Foo1 and Foo2 are serializable
 I've tried this with both the Java and Scala API, lines for each are:
 org.apache.spark.api.java.JavaRDD.reduce(JavaRDD.scala:32)
 org.apache.spark.rdd.RDD.reduce(RDD.scala:861)
 The thing being flagged is always: org.apache.spark.SparkContext$$anonfun$26 
 (the number doesn't change).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4022) Replace colt dependency (LGPL) with commons-math

2014-10-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1416#comment-1416
 ] 

Sean Owen commented on SPARK-4022:
--

Ah right there is Jet too, not just Colt.

The LGPL license actually only pertains to a few parts of Colt, in hep.aida.*, 
which aren't used by Spark.
Another solution is just to make sure these classes never become part of the 
distribution. Colt and Jet themselves don't appear to be LGPL, in the main.

Of course, if there was a desire to just stop using Colt+Jet anyway, I'm cool 
with that too.

 Replace colt dependency (LGPL) with commons-math
 

 Key: SPARK-4022
 URL: https://issues.apache.org/jira/browse/SPARK-4022
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Critical

 The colt library we use is LGPL-licensed:
 http://acs.lbl.gov/ACSSoftware/colt/license.html
 We need to swap this out for commons-math. It is also a very old library that 
 hasn't been updated since 2004.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4032) Deprecate YARN alpha support in Spark 1.2

2014-10-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4032:
-
Summary: Deprecate YARN alpha support in Spark 1.2  (was: Deprecate YARN 
support in Spark 1.2)

 Deprecate YARN alpha support in Spark 1.2
 -

 Key: SPARK-4032
 URL: https://issues.apache.org/jira/browse/SPARK-4032
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, YARN
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Blocker

 When someone builds for yarn alpha, we should just display a warning like
 {code}
 ***WARNING***: Support for YARN-alpha API's will be removed in Spark 1.3 (see 
 SPARK-3445).
 {code}
 We can print a warning when the yarn-alpha profile is used:
 http://stackoverflow.com/questions/3416573/how-can-i-display-a-message-in-maven



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4034) get java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder in idea

2014-10-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178219#comment-14178219
 ] 

Sean Owen commented on SPARK-4034:
--

You get this when you do what? This is a Guava class. The build compiles and 
runs correctly, and does for me in IDEA too, so this is probably something 
wrong with your local setup.

 get  java.lang.NoClassDefFoundError: 
 com/google/common/util/concurrent/ThreadFactoryBuilder  in idea
 

 Key: SPARK-4034
 URL: https://issues.apache.org/jira/browse/SPARK-4034
 Project: Spark
  Issue Type: Bug
Reporter: baishuo





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4002) JavaKafkaStreamSuite.testKafkaStream fails on OSX

2014-10-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178239#comment-14178239
 ] 

Sean Owen commented on SPARK-4002:
--

This passes for me right now in {{master}}:

mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests package  mvn test 
-Dsuites='*KafkaStreamSuite'

It also passed with the default settings -- no {{hadoop.version}}, etc.

 JavaKafkaStreamSuite.testKafkaStream fails on OSX
 -

 Key: SPARK-4002
 URL: https://issues.apache.org/jira/browse/SPARK-4002
 Project: Spark
  Issue Type: Bug
  Components: Streaming
 Environment: Mac OSX 10.9.5.
Reporter: Ryan Williams

 [~sowen] mentioned this on spark-dev 
 [here|http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccamassdjs0fmsdc-k-4orgbhbfz2vvrmm0hfyifeeal-spft...@mail.gmail.com%3E]
  and I just reproduced it on {{master}} 
 ([7e63bb4|https://github.com/apache/spark/commit/7e63bb49c526c3f872619ae14e4b5273f4c535e9]).
 The relevant output I get when running {{./dev/run-tests}} is:
 {code}
 [info] KafkaStreamSuite:
 [info] - Kafka input stream
 [info] Test run started
 [info] Test 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream started
 [error] Test 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream failed: 
 junit.framework.AssertionFailedError: expected:3 but was:0
 [error] at junit.framework.Assert.fail(Assert.java:50)
 [error] at junit.framework.Assert.failNotEquals(Assert.java:287)
 [error] at junit.framework.Assert.assertEquals(Assert.java:67)
 [error] at junit.framework.Assert.assertEquals(Assert.java:199)
 [error] at junit.framework.Assert.assertEquals(Assert.java:205)
 [error] at 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream(JavaKafkaStreamSuite.java:129)
 [error] ...
 [info] Test run finished: 1 failed, 0 ignored, 1 total, 19.798s
 {code}
 Seems like this test should be {{@Ignore}}'d, or some note about this made on 
 the {{README.md}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3955) Different versions between jackson-mapper-asl and jackson-core-asl

2014-10-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178294#comment-14178294
 ] 

Sean Owen commented on SPARK-3955:
--

I think the issue is matching the version used by Hadoop. Your usage of Jackson 
shouldn't affect what Spark uses, but I imagine you're saying this is a case of 
dependency leakage?

 Different versions between jackson-mapper-asl and jackson-core-asl
 --

 Key: SPARK-3955
 URL: https://issues.apache.org/jira/browse/SPARK-3955
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.1.0
Reporter: Jongyoul Lee

 In the parent pom.xml, specified a version of jackson-mapper-asl. This is 
 used by sql/hive/pom.xml. When mvn assembly runs, however, jackson-mapper-asl 
 is not same as jackson-core-asl. This is because other libraries use several 
 versions of jackson, so other version of jackson-core-asl is assembled. 
 Simply, fix this problem if pom.xml has a specific version information of 
 jackson-core-asl. If it's not set, a version 1.9.11 is merged info 
 assembly.jar and we cannot use jackson library properly.
 {code}
 [INFO] Including org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8 in the 
 shaded jar.
 [INFO] Including org.codehaus.jackson:jackson-core-asl:jar:1.9.11 in the 
 shaded jar.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3359) `sbt/sbt unidoc` doesn't work with Java 8

2014-10-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178583#comment-14178583
 ] 

Sean Owen commented on SPARK-3359:
--

Sure, but it can be fixed right now too. I tried to figure out how to set 
plugin properties in SBT and failed, although, I'm sure it's not a bit hard to 
someone who knows how it works.

 `sbt/sbt unidoc` doesn't work with Java 8
 -

 Key: SPARK-3359
 URL: https://issues.apache.org/jira/browse/SPARK-3359
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Priority: Minor

 It seems that Java 8 is stricter on JavaDoc. I got many error messages like
 {code}
 [error] 
 /Users/meng/src/spark-mengxr/core/target/java/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.java:2:
  error: modifier private not allowed here
 [error] private abstract interface SparkHadoopMapRedUtil {
 [error]  ^
 {code}
 This is minor because we can always use Java 6/7 to generate the doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4021) Issues observed after upgrading Jenkins to JDK7u71

2014-10-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179283#comment-14179283
 ] 

Sean Owen commented on SPARK-4021:
--

[~shaneknapp] I don't think this can have anything to do with OpenJDK per se. 
The error clearly indicates that either a) javac 5 or earlier was used, or b) 
javac was told to target source 1.5 or earlier. OpenJDK Java 7 will compile 
Spark source entirely correctly. I just tried it on Ubuntu even. There is 
something else at play here -- some other Java or other configuration was used 
after the change. It's probably nothing to do with the upgrade per se but 
something to do with changing or unsetting things like JAVA_HOME.

 Issues observed after upgrading Jenkins to JDK7u71
 --

 Key: SPARK-4021
 URL: https://issues.apache.org/jira/browse/SPARK-4021
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
 Environment: JDK 7u71
Reporter: Patrick Wendell
Assignee: shane knapp

 The following compile failure was observed after adding JDK7u71 to Jenkins. 
 However, this is likely a misconfiguration from Jenkins rather than an issue 
 with Spark (these errors are specific to JDK5, in fact).
 {code}
 [error] --
 [error] 1. WARNING in 
 /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  (at line 83)
 [error]   private static final Logger logger = 
 Logger.getLogger(JavaKinesisWordCountASL.class);
 [error]   ^^
 [error] The field JavaKinesisWordCountASL.logger is never read locally
 [error] --
 [error] 2. WARNING in 
 /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  (at line 151)
 [error]   JavaDStreamString words = unionStreams.flatMap(new 
 FlatMapFunctionbyte[], String() {
 [error]
 ^
 [error] The serializable class  does not declare a static final 
 serialVersionUID field of type long
 [error] --
 [error] 3. ERROR in 
 /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  (at line 153)
 [error]   public IterableString call(byte[] line) {
 [error]   ^
 [error] The method call(byte[]) of type new 
 FlatMapFunctionbyte[],String(){} must override a superclass method
 [error] --
 [error] 4. WARNING in 
 /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  (at line 160)
 [error]   new PairFunctionString, String, Integer() {
 [error]   ^^^
 [error] The serializable class  does not declare a static final 
 serialVersionUID field of type long
 [error] --
 [error] 5. ERROR in 
 /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  (at line 162)
 [error]   public Tuple2String, Integer call(String s) {
 [error]  ^^
 [error] The method call(String) of type new 
 PairFunctionString,String,Integer(){} must override a superclass method
 [error] --
 [error] 6. WARNING in 
 /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  (at line 165)
 [error]   }).reduceByKey(new Function2Integer, Integer, Integer() {
 [error]  ^^
 [error] The serializable class  does not declare a static final 
 serialVersionUID field of type long
 [error] --
 [error] 7. ERROR in 
 /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  (at line 167)
 [error]   public Integer call(Integer i1, Integer i2) {
 [error]  
 [error] The method call(Integer, Integer) of type new 
 Function2Integer,Integer,Integer(){} must override a superclass method
 [error] --
 [error] 7 problems (3 errors, 4 warnings)
 {code}



--
This message was sent by Atlassian JIRA

[jira] [Resolved] (SPARK-4018) RDD.reduce failing with java.lang.ClassCastException: org.apache.spark.SparkContext$$anonfun$26 cannot be cast to scala.Function2

2014-10-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4018.
--
Resolution: Not a Problem

 RDD.reduce failing with java.lang.ClassCastException: 
 org.apache.spark.SparkContext$$anonfun$26 cannot be cast to scala.Function2
 -

 Key: SPARK-4018
 URL: https://issues.apache.org/jira/browse/SPARK-4018
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1
Reporter: Haithem Turki
Priority: Critical

 Hey all,
 A simple reduce operation against Spark 1.1.1 is giving me following 
 exception:
 {code}
 14/10/20 16:27:22 ERROR executor.Executor: Exception in task 9.7 in stage 0.0 
 (TID 1001)
 java.lang.ClassCastException: org.apache.spark.SparkContext$$anonfun$26 
 cannot be cast to scala.Function2
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 My code is a relatively simple map-reduce:
 {code}
  MapString, Foo aggregateTracker = rdd.map(new MapFunction(list))
 .reduce(new ReduceFunction());
 {code}
 Where:
 - MapFunction is of type FunctionRecord, MapString, Object
 - ReduceFunction is of type Function2MapString, Foo, MapString, Foo, 
 MapString, Foo
 - list is just a list of Foo2
 Both Foo1 and Foo2 are serializable
 I've tried this with both the Java and Scala API, lines for each are:
 org.apache.spark.api.java.JavaRDD.reduce(JavaRDD.scala:32)
 org.apache.spark.rdd.RDD.reduce(RDD.scala:861)
 The thing being flagged is always: org.apache.spark.SparkContext$$anonfun$26 
 (the number doesn't change).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4044) Thriftserver fails to start when JAVA_HOME points to JRE instead of JDK

2014-10-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179620#comment-14179620
 ] 

Sean Owen commented on SPARK-4044:
--

How about using {{unzip -l}} to probe the contents of the .jar files? They're 
just zip files after all. You check the exit status to see if it contained the 
entry in question -- 0 if it did, non-0 otherwise.

I am not sure how this will interact with the check for an invalid JAR file 
that is also in the script though.

 Thriftserver fails to start when JAVA_HOME points to JRE instead of JDK
 ---

 Key: SPARK-4044
 URL: https://issues.apache.org/jira/browse/SPARK-4044
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: Josh Rosen

 If {{JAVA_HOME}} points to a JRE instead of a JDK, e.g.
 {code}
 JAVA_HOME=/usr/lib/jvm/java-7-oracle/jre/ 
 {code}
 instead of 
 {code}
 JAVA_HOME=/usr/lib/jvm/java-7-oracle/
 {code}
 Then start-thriftserver.sh will fail with Datanucleus JAR errors:
 {code}
 Caused by: java.lang.ClassNotFoundException: 
 org.datanucleus.api.jdo.JDOPersistenceManagerFactory
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:270)
   at javax.jdo.JDOHelper$18.run(JDOHelper.java:2018)
   at javax.jdo.JDOHelper$18.run(JDOHelper.java:2016)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.jdo.JDOHelper.forName(JDOHelper.java:2015)
   at 
 javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162)
 {code}
 The root problem seems to be that {{compute-classpath.sh}} uses {{JAVA_HOME}} 
 to find the path to the {{jar}} command, which isn't present in JRE 
 directories.  This leads to silent failures when adding the Datanucleus JARs 
 to the classpath.
 This same issue presumably affects the command that checks whether Spark was 
 built on Java 7 but run on Java 6.
 We should probably add some error-handling that checks whether the {{jar}} 
 command is actually present and throws an error otherwise, and also update 
 the documentation to say that `JAVA_HOME` must point to a JDK and not a JRE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4046) Incorrect Java example on site

2014-10-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4046:
-
 Priority: Minor  (was: Critical)
Affects Version/s: 1.1.0
  Summary: Incorrect Java example on site  (was: Incorrect examples 
on site)

 Incorrect Java example on site
 --

 Key: SPARK-4046
 URL: https://issues.apache.org/jira/browse/SPARK-4046
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Java API
Affects Versions: 1.1.0
 Environment: Web
Reporter: Ian Babrou
Priority: Minor

 https://spark.apache.org/examples.html
 Here word count example for java is incorrect. It should be mapToPair instead 
 of map. Correct example is here:
 https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaWordCount.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3359) `sbt/sbt unidoc` doesn't work with Java 8

2014-10-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181503#comment-14181503
 ] 

Sean Owen commented on SPARK-3359:
--

I inquired about these with the plugin project: 
https://github.com/typesafehub/genjavadoc/issues/43 
https://github.com/typesafehub/genjavadoc/issues/44

 `sbt/sbt unidoc` doesn't work with Java 8
 -

 Key: SPARK-3359
 URL: https://issues.apache.org/jira/browse/SPARK-3359
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Priority: Minor

 It seems that Java 8 is stricter on JavaDoc. I got many error messages like
 {code}
 [error] 
 /Users/meng/src/spark-mengxr/core/target/java/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.java:2:
  error: modifier private not allowed here
 [error] private abstract interface SparkHadoopMapRedUtil {
 [error]  ^
 {code}
 This is minor because we can always use Java 6/7 to generate the doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4022) Replace colt dependency (LGPL) with commons-math

2014-10-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181524#comment-14181524
 ] 

Sean Owen commented on SPARK-4022:
--

I have begun work on this. You can see the base change here:

https://github.com/srowen/spark/commits/SPARK-4022
https://github.com/srowen/spark/commit/8246dbd39be7ff162392c59c28dee74a1419e236

There are 4 potential problems, each of which might need some assistance as to 
how to proceed.

*HypothesisTestSuite failure*

CC [~dorx]

{code}
HypothesisTestSuite:
  ...
- chi squared pearson RDD[LabeledPoint] *** FAILED ***
  org.apache.commons.math3.exception.NotStrictlyPositiveException: shape (0)
  at 
org.apache.commons.math3.distribution.GammaDistribution.init(GammaDistribution.java:168)
  ...
  at 
org.apache.spark.mllib.stat.test.ChiSqTest$.chiSquaredMatrix(ChiSqTest.scala:241)
  at 
org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:134)
  at 
org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:125)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
{code}

The commons-math3 implementation complains that a chi squared distribution is 
created with 0 degrees of freedom. It looks like that for col 645 in this test, 
there is just one feature, 0, and two labels. The contingency table should be 
at least 2x2 but it's 1x2 only, and that's not valid AFAICT. I spent some time 
staring at this and don't quite know what to make of fixing it.

*KMeansClusterSuite failure*

CC [~mengxr]

{code}
KMeansClusterSuite:
- task size should be small in both training and prediction *** FAILED ***
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 
8, localhost): java.io.InvalidClassException: 
org.apache.spark.util.random.PoissonSampler; local class incompatible: stream 
classdesc serialVersionUID = -795011761847245121, local class serialVersionUID 
= 424924496318419
{code}

I understand what it's saying. PoissonSampler did indeed change and its 
serialized form changed, but, I don't see how two incompatible versions are 
turning up as a result of one clean build.

*RandomForestSuite failure*

CC [~josephkb]

{code}
RandomForestSuite:
...
- alternating categorical and continuous features with multiclass labels to 
test indexing *** FAILED ***
  java.lang.AssertionError: assertion failed: validateClassifier calculated 
accuracy 0.75 but required 1.0.
  at scala.Predef$.assert(Predef.scala:179)
  at 
org.apache.spark.mllib.tree.RandomForestSuite$.validateClassifier(RandomForestSuite.scala:227)
{code}

My guess on this one is that something is sampled differently as a result of 
this change, and happens to make the decision forest come out differently on 
this toy data set, and it happens to get 3/4 instead of 4/4 right now. This may 
be ignorable, meaning, the test was actually a little too strict and optimistic.

*Less efficient seeded sampling for series of Poisson variables*

CC [~dorx]

Colt had a way to seed the RNG, then generate a one-off sample from a Poisson 
distribution with mean m. commons-math3 lets you seed an instance of a Poisson 
distribution with mean m, but then not change that mean. To simulate, it's 
necessary to recreate a Poisson distribution with each successive mean with a 
deterministic series of seeds.

See here: 
https://github.com/srowen/spark/commit/8246dbd39be7ff162392c59c28dee74a1419e236#diff-0544248063499d8688c21f49be0918c8R285

This isn't a problem per se but could be slower. I am not sure if this code can 
be changed to not require constant reinitialization of the distribution.

 Replace colt dependency (LGPL) with commons-math
 

 Key: SPARK-4022
 URL: https://issues.apache.org/jira/browse/SPARK-4022
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Critical

 The colt library we use is LGPL-licensed:
 http://acs.lbl.gov/ACSSoftware/colt/license.html
 We need to swap this out for commons-math. It is also a very old library that 
 hasn't been updated since 2004.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4063) Add the ability to send messages to Kafka in the stream

2014-10-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181526#comment-14181526
 ] 

Sean Owen commented on SPARK-4063:
--

Is this like a streaming operation that saves an RDD to a Kafka queue?
The general ability to write to Kafka is of course accomplishable by just using 
the Kafka APIs.

 Add the ability to send messages to Kafka in the stream
 ---

 Key: SPARK-4063
 URL: https://issues.apache.org/jira/browse/SPARK-4063
 Project: Spark
  Issue Type: New Feature
  Components: Input/Output
Reporter: Helena Edelson

 Currently you can only receive from Kafka in the stream. This would be adding 
 the ability to publish from the stream as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4066) Make whether maven builds fails on scalastyle violation configurable

2014-10-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182658#comment-14182658
 ] 

Sean Owen commented on SPARK-4066:
--

Does this work? -Dscalastyle.failOnViolation was already a built-in way to 
control this, so if it didn't work a) I'm surprised and b) not clear this will 
work either.

I suppose I don't think this is something that needs to be configurable. The 
pain is just that someone writes new code that violates style rules? this has 
to be fixed anyway and fixing style stuff is not hard.

 Make whether maven builds fails on scalastyle violation configurable
 

 Key: SPARK-4066
 URL: https://issues.apache.org/jira/browse/SPARK-4066
 Project: Spark
  Issue Type: Improvement
Reporter: Ted Yu
Priority: Minor
 Attachments: spark-4066-v1.txt


 Here is the thread Koert started:
 http://search-hadoop.com/m/JW1q5j8z422/scalastyle+annoys+me+a+little+bitsubj=scalastyle+annoys+me+a+little+bit
 It would be flexible if whether maven build fails due to scalastyle violation 
 configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4022) Replace colt dependency (LGPL) with commons-math

2014-10-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182781#comment-14182781
 ] 

Sean Owen commented on SPARK-4022:
--

[~mengxr] [~josephkb] Great, most of this is resolved now. 

{{KMeansSuite}} still fails for me, yes, after a clean build. I wonder if the 
problem is that the commons-math3 code is in a different package in the 
assembly? Before thinking too hard about it, let me open a PR to see what 
Jenkins makes of it.

I also implemented the different approach to Poisson sampler seeding. It would 
be good to cap the size of the cache, although, I wonder if that could lead to 
problems. If a sampler is removed and recreated, it will start generating the 
same sequence again from the same seed. If it is not seeded, it will be 
nondeterministic. It looks like {{RandomDataGenerator}} instances are 
short-lived and applied to a fixed set of mean values, which suggests this 
won't blow up readily. I admit I just glanced at the usages though.

 Replace colt dependency (LGPL) with commons-math
 

 Key: SPARK-4022
 URL: https://issues.apache.org/jira/browse/SPARK-4022
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Critical

 The colt library we use is LGPL-licensed:
 http://acs.lbl.gov/ACSSoftware/colt/license.html
 We need to swap this out for commons-math. It is also a very old library that 
 hasn't been updated since 2004.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4066) Make whether maven builds fails on scalastyle violation configurable

2014-10-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183340#comment-14183340
 ] 

Sean Owen commented on SPARK-4066:
--

Yeah, it's another flag to add to the build, but it's minor.
But how about instead just making scalastyle not run for package phase at all?
I don't feel strongly against this, to be sure.

 Make whether maven builds fails on scalastyle violation configurable
 

 Key: SPARK-4066
 URL: https://issues.apache.org/jira/browse/SPARK-4066
 Project: Spark
  Issue Type: Improvement
Reporter: Ted Yu
Priority: Minor
 Attachments: spark-4066-v1.txt


 Here is the thread Koert started:
 http://search-hadoop.com/m/JW1q5j8z422/scalastyle+annoys+me+a+little+bitsubj=scalastyle+annoys+me+a+little+bit
 It would be flexible if whether maven build fails due to scalastyle violation 
 configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4111) [MLlib] Implement regression model evaluation metrics

2014-10-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186496#comment-14186496
 ] 

Sean Owen commented on SPARK-4111:
--

Is this more than just MAE / RMSE / R2? It might be handy to have a little 
utility class for these although they're almost one-liners already.

 [MLlib] Implement regression model evaluation metrics
 -

 Key: SPARK-4111
 URL: https://issues.apache.org/jira/browse/SPARK-4111
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Yanbo Liang

 Supervised machine learning include classification and regression. There is 
 classification metrics (BinaryClassificationMetrics) in MLlib, we also need 
 regression metrics to evaluate the regression model and tunning parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4121) Master build failures after shading commons-math3

2014-10-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187475#comment-14187475
 ] 

Sean Owen commented on SPARK-4121:
--

Yeah I was seeing this locally, but not on the Jenkins test build, so chalked 
it up to weirdness in my build.
I think the answer may indeed be to do the relocating in core/mllib itself. 
I'll get on that.

 Master build failures after shading commons-math3
 -

 Key: SPARK-4121
 URL: https://issues.apache.org/jira/browse/SPARK-4121
 Project: Spark
  Issue Type: Bug
  Components: Build, MLlib, Spark Core
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Priority: Blocker

 The Spark master Maven build kept failing after we replace colt with 
 commons-math3 and shade the latter:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/
 The error message is:
 {code}
 KMeansClusterSuite:
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 - task size should be small in both training and prediction *** FAILED ***
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 
 (TID 9, localhost): java.io.InvalidClassException: 
 org.apache.spark.util.random.PoissonSampler; local class incompatible: stream 
 classdesc serialVersionUID = -795011761847245121, local class 
 serialVersionUID = 424924496318419
 java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
 org.apache.spark.scheduler.Task.run(Task.scala:56)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 This test passed in local sbt build. So the issue should be caused by 
 shading. Maybe there are two versions of commons-math3 (hadoop depends on 
 it), or MLlib doesn't use the shaded version at compile.
 [~srowen] Could you take a look? Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4132) Spark uses incompatible HDFS API

2014-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4132.
--
Resolution: Duplicate

I'm all but certain you're describing the same thing as SPARK-4078

 Spark uses incompatible HDFS API
 

 Key: SPARK-4132
 URL: https://issues.apache.org/jira/browse/SPARK-4132
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: Spark1.1.0 on Hadoop1.2.1
 CentOS 6.3 64bit
Reporter: kuromatsu nobuyuki
Priority: Minor

 When I enable event logging and set it to output to HDFS, initialization 
 fails with 'java.lang.ClassNotFoundException' (see trace below).
 I found that an API incompatibility in 
 org.apache.hadoop.fs.permission.FsPermission between Hadoop 1.0.4 and Hadoop 
 1.1.0 (and above) causes this error 
 (org.apache.hadoop.fs.permission.FsPermission$2 is used in 1.0.4 but doesn't 
 exist in my 1.2.1 environment).
 I think that the Spark jar file pre-built for Hadoop1.X should be built on 
 Hadoop Stable version(Hadoop 1.2.1).
 2014-10-24 10:43:22,893 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 9000: 
 readAndProcess threw exception java.lang.RuntimeException: 
 readObject can't find class org.apache.hadoop.fs.permission.FsPermission$2. 
 Count of bytes read: 0
 java.lang.RuntimeException: readObject can't find class 
 org.apache.hadoop.fs.permission.FsPermission$2
 at 
 org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:233)
 at org.apache.hadoop.ipc.RPC$Invocation.readFields(RPC.java:106)
 at 
 org.apache.hadoop.ipc.Server$Connection.processData(Server.java:1347)
 at 
 org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1326)
 at 
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1226)
 at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:577)
 at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:384)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:701)
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.hadoop.fs.permission.FsPermission$2
 at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
 at 
 org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:231)
 ... 9 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-683) Spark 0.7 with Hadoop 1.0 does not work with current AMI's HDFS installation

2014-10-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188160#comment-14188160
 ] 

Sean Owen commented on SPARK-683:
-

PS I think this also turns out to be the same as SPARK-4078

 Spark 0.7 with Hadoop 1.0 does not work with current AMI's HDFS installation
 

 Key: SPARK-683
 URL: https://issues.apache.org/jira/browse/SPARK-683
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 0.7.0
Reporter: Tathagata Das

 A simple saveAsObjectFile() leads to the following error.
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: 
 java.lang.NoSuchMethodException: 
 org.apache.hadoop.hdfs.protocol.ClientProtocol.create(java.lang.String, 
 org.apache.hadoop.fs.permission.FsPermission, java.lang.String, boolean, 
 boolean, short, long)
   at java.lang.Class.getMethod(Class.java:1622)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:416)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4147) Remove log4j dependency

2014-10-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189739#comment-14189739
 ] 

Sean Owen commented on SPARK-4147:
--

Yes, slf4j does not let you actually control logging levels. You have to call 
into a specific logging implementation for this, and that's why log4j is used 
directly. You can re-route log4j to slf4j and slf4j to your own logger, still. 
I think this on purpose, so I don't necessarily think this should change. 

 Remove log4j dependency
 ---

 Key: SPARK-4147
 URL: https://issues.apache.org/jira/browse/SPARK-4147
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Tobias Pfeiffer

 spark-core has a hard dependency on log4j, which shouldn't be necessary since 
 slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my 
 sbt file.
 Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. 
 However, removing the log4j dependency fails because in 
 https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121
  a static method of org.apache.log4j.LogManager is accessed *even if* log4j 
 is not in use.
 I guess removing all dependencies on log4j may be a bigger task, but it would 
 be a great help if the access to LogManager would be done only if log4j use 
 was detected before. (This is a 2-line change.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4121) Master build failures after shading commons-math3

2014-10-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190708#comment-14190708
 ] 

Sean Owen commented on SPARK-4121:
--

Sorry I have been traveling without internet. When I tried revising the shading 
last night, tests hung. Probably due to network weirdness. 

My concern was that by shading just 2 modules you have to watch for other 
modules that accidentally start using Math3. 

I think a less drastic solution like modulating the version with hadoop profile 
sounds fine in principle. I'd have to look up what version goes with what. I 
think we are not using any very new methods. 

Sorry about that and please proceed as you see fit though I will have another 
look tonight. 

 Master build failures after shading commons-math3
 -

 Key: SPARK-4121
 URL: https://issues.apache.org/jira/browse/SPARK-4121
 Project: Spark
  Issue Type: Bug
  Components: Build, MLlib, Spark Core
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Priority: Blocker

 The Spark master Maven build kept failing after we replace colt with 
 commons-math3 and shade the latter:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/
 The error message is:
 {code}
 KMeansClusterSuite:
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 - task size should be small in both training and prediction *** FAILED ***
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 
 (TID 9, localhost): java.io.InvalidClassException: 
 org.apache.spark.util.random.PoissonSampler; local class incompatible: stream 
 classdesc serialVersionUID = -795011761847245121, local class 
 serialVersionUID = 424924496318419
 java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
 org.apache.spark.scheduler.Task.run(Task.scala:56)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 This test passed in local sbt build. So the issue should be caused by 
 shading. Maybe there are two versions of commons-math3 (hadoop depends on 
 it), or MLlib doesn't use the shaded version at compile.
 [~srowen] Could you take a look? Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4162) Make scripts symlinkable

2014-10-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4162.
--
Resolution: Duplicate

Duplicate of https://issues.apache.org/jira/browse/SPARK-3482 and 
https://issues.apache.org/jira/browse/SPARK-2960  Have a look at the PR for 
3482 and suggest changes. This has come up several times so would be good to 
get it fixed.

 Make scripts symlinkable 
 -

 Key: SPARK-4162
 URL: https://issues.apache.org/jira/browse/SPARK-4162
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, EC2, Spark Shell
Affects Versions: 1.1.0
 Environment: Mac, linux
Reporter: Shay Seng

 Scripts are not symlink-able  because they all use:
 FWDIR=$(cd `dirname $0`/..; pwd) 
 to detect the parent Spark dir, which doesn't take into account symlinks. 
 Instead replace the above line with:
 SOURCE=$0;
 SCRIPT=`basename $SOURCE`;
 while [ -h $SOURCE ]; do
 SCRIPT=`basename $SOURCE`;
 LOOKUP=`ls -ld $SOURCE`;
 TARGET=`expr $LOOKUP : '.*- \(.*\)$'`;
 if expr ${TARGET:-.}/ : '/.*/$'  /dev/null; then
 SOURCE=${TARGET:-.};
 else
 SOURCE=`dirname $SOURCE`/${TARGET:-.};
 fi;
 done;
 FWDIR=$(cd `dirname $SOURCE`/..; pwd)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4170) Closure problems when running Scala app that extends App

2014-10-31 Thread Sean Owen (JIRA)

Sean Owen created SPARK-4170:


 Summary: Closure problems when running Scala app that extends App
 Key: SPARK-4170
 URL: https://issues.apache.org/jira/browse/SPARK-4170
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Sean Owen
Priority: Minor


Michael Albert noted this problem on the mailing list 
(http://apache-spark-user-list.1001560.n3.nabble.com/BUG-when-running-as-quot-extends-App-quot-closures-don-t-capture-variables-td17675.html):

{code}
object DemoBug extends App {
val conf = new SparkConf()
val sc = new SparkContext(conf)

val rdd = sc.parallelize(List(A,B,C,D))
val str1 = A

val rslt1 = rdd.filter(x = { x != A }).count
val rslt2 = rdd.filter(x = { str1 != null  x != A }).count

println(DemoBug: rslt1 =  + rslt1 +  rslt2 =  + rslt2)
}
{code}

This produces the output:

{code}
DemoBug: rslt1 = 3 rslt2 = 0
{code}

If instead there is a proper main(), it works as expected.


I also this week noticed that in a program which extends App, some values 
were inexplicably null in a closure. When changing to use main(), it was fine.

I assume there is a problem with variables not being added to the closure when 
main() doesn't appear in the standard way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4196) Streaming + checkpointing yields NotSerializableException for Hadoop Configuration from saveAsNewAPIHadoopFiles ?

2014-11-02 Thread Sean Owen (JIRA)

Sean Owen created SPARK-4196:


 Summary: Streaming + checkpointing yields NotSerializableException 
for Hadoop Configuration from saveAsNewAPIHadoopFiles ?
 Key: SPARK-4196
 URL: https://issues.apache.org/jira/browse/SPARK-4196
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Sean Owen


I am reasonably sure there is some issue here in Streaming and that I'm not 
missing something basic, but not 100%. I went ahead and posted it as a JIRA to 
track, since it's come up a few times before without resolution, and right now 
I can't get checkpointing to work at all.

When Spark Streaming checkpointing is enabled, I see a NotSerializableException 
thrown for a Hadoop Configuration object, and it seems like it is not one from 
my user code.

Before I post my particular instance see 
http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1408135046777-12202.p...@n3.nabble.com%3E
 for another occurrence.

I was also on customer site last week debugging an identical issue with 
checkpointing in a Scala-based program and they also could not enable 
checkpointing without hitting exactly this error.

The essence of my code is:

{code}
final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);

JavaStreamingContextFactory streamingContextFactory = new
JavaStreamingContextFactory() {
  @Override
  public JavaStreamingContext create() {
return new JavaStreamingContext(sparkContext, new
Duration(batchDurationMS));
  }
};

  streamingContext = JavaStreamingContext.getOrCreate(
  checkpointDirString, sparkContext.hadoopConfiguration(),
streamingContextFactory, false);
  streamingContext.checkpoint(checkpointDirString);
{code}

It yields:

{code}
2014-10-31 14:29:00,211 ERROR OneForOneStrategy:66
org.apache.hadoop.conf.Configuration
- field (class 
org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9,
name: conf$2, type: class org.apache.hadoop.conf.Configuration)
- object (class
org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9,
function2)
- field (class org.apache.spark.streaming.dstream.ForEachDStream,
name: org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc,
type: interface scala.Function2)
- object (class org.apache.spark.streaming.dstream.ForEachDStream,
org.apache.spark.streaming.dstream.ForEachDStream@cb8016a)
...
{code}


This looks like it's due to PairRDDFunctions, as this saveFunc seems
to be  org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9
:

{code}
def saveAsNewAPIHadoopFiles(
prefix: String,
suffix: String,
keyClass: Class[_],
valueClass: Class[_],
outputFormatClass: Class[_ : NewOutputFormat[_, _]],
conf: Configuration = new Configuration
  ) {
  val saveFunc = (rdd: RDD[(K, V)], time: Time) = {
val file = rddToFileName(prefix, suffix, time)
rdd.saveAsNewAPIHadoopFile(file, keyClass, valueClass,
outputFormatClass, conf)
  }
  self.foreachRDD(saveFunc)
}
{code}

Is that not a problem? but then I don't know how it would ever work in Spark. 
But then again I don't see why this is an issue and only when checkpointing is 
enabled.

Long-shot, but I wonder if it is related to closure issues like 
https://issues.apache.org/jira/browse/SPARK-1866



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-11-02 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194315#comment-14194315
]

Sean Owen commented on SPARK-1406:
--

I put some comments on the PR. Thanks for starting on this. I think PMML
interoperability is indeed helpful.

So, one big issue here is that MLlib does not at the moment have any notion of
a schema. PMML does, and this is vital to actually using the model elsewhere.
You have to document what the variables are so they can be matched up with the
same variables in another tool. So it's not possible now to do anything but
make a model with field_1, field_2, ... This calls into question whether
PMML can be meaningfully exported at this point from MLlib? Maybe it will have
to wait until other PRs go in that start to add schema.

I also thought it would be a little better to separate the representation of a
model, from utility methods to write the model to things like files. The latter
can be at least separated out of the type hierarchy. I'm also wondering how
much value it adds to design for non-PMML export at this stage.

(Finally I have some code lying around here that will translate the MLlib
logistic regression model to PMML. I can put that in the pot at a suitable
time.)

PMML model evaluation support via MLib
--

Key: SPARK-1406
URL: https://issues.apache.org/jira/browse/SPARK-1406
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Thomas Darimont
Attachments: SPARK-1406.pdf, kmeans.xml

It would be useful if spark would provide support the evaluation of PMML
models (http://www.dmg.org/v4-2/GeneralStructure.html).
This would allow to use analytical models that were created with a
statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which
would perform the actual model evaluation for a given input tuple. The PMML
model would then just contain the parameterization of an analytical model.
Other projects like JPMML-Evaluator do a similar thing.
https://github.com/jpmml/jpmml/tree/master/pmml-evaluator

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4206) BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it

2014-11-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194550#comment-14194550
 ] 

Sean Owen commented on SPARK-4206:
--

I think there was a discussion about this and the consensus was that these 
aren't anything to worry about and can be info-level messages?

 BlockManager warnings in local mode: Block $blockId already exists on this 
 machine; not re-adding it
 -

 Key: SPARK-4206
 URL: https://issues.apache.org/jira/browse/SPARK-4206
 Project: Spark
  Issue Type: Bug
Reporter: Imran Rashid
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2432) Apriori algorithm for frequent itemset mining

2014-11-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2432.
--
Resolution: Duplicate

Resolving as duplicate of the later issue with more discussion.

 Apriori algorithm for frequent itemset mining
 -

 Key: SPARK-2432
 URL: https://issues.apache.org/jira/browse/SPARK-2432
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: lukovnikov

 A parallel implementation of the apriori algorithm.
 Apriori is a well-known and simple algorithm that finds frequent itemsets and 
 lends itself perfectly for a parallel implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3954) source code optimization

2014-11-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195956#comment-14195956
 ] 

Sean Owen commented on SPARK-3954:
--

source code optimization is not a good JIRA title. Please don't change it 
back again.

 source code optimization
 

 Key: SPARK-3954
 URL: https://issues.apache.org/jira/browse/SPARK-3954
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0, 1.1.0
Reporter: 宿荣全

 about convert files to RDDS there are 3 loops with files sequence in spark 
 source.
 loops files sequence:
 1.files.map(...)
 2.files.zip(fileRDDs)
 3.files-size.foreach
 modiy 3 recursions  to 1 recursion.
 spark source code:
   private def filesToRDD(files: Seq[String]): RDD[(K, V)] = {
 val fileRDDs = files.map(file = context.sparkContext.newAPIHadoopFile[K, 
 V, F](file))
 files.zip(fileRDDs).foreach { case (file, rdd) = {
   if (rdd.partitions.size == 0) {
 logError(File  + file +  has no data in it. Spark Streaming can 
 only ingest  +
   files that have been \moved\ to the directory assigned to the 
 file stream.  +
   Refer to the streaming programming guide for more details.)
   }
 }}
 new UnionRDD(context.sparkContext, fileRDDs)
   }
 // 
 ---
 modified code:
   private def filesToRDD(files: Seq[String]): RDD[(K, V)] = {
 val fileRDDs = for (file - files; rdd = 
 context.sparkContext.newAPIHadoopFile[K, V, F](file)) 
   yield {
   if (rdd.partitions.size == 0) {
 logError(File  + file +  has no data in it. Spark Streaming can 
 only ingest  +
   files that have been \moved\ to the directory assigned to the 
 file stream.  +
   Refer to the streaming programming guide for more details.)
   }
   rdd
 }
 new UnionRDD(context.sparkContext, fileRDDs)
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3954) Optimization to FileInputDStream

2014-11-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3954:
-
Summary: Optimization to FileInputDStream  (was: source code optimization)

 Optimization to FileInputDStream
 

 Key: SPARK-3954
 URL: https://issues.apache.org/jira/browse/SPARK-3954
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0, 1.1.0
Reporter: 宿荣全

 about convert files to RDDS there are 3 loops with files sequence in spark 
 source.
 loops files sequence:
 1.files.map(...)
 2.files.zip(fileRDDs)
 3.files-size.foreach
 modiy 3 recursions  to 1 recursion.
 spark source code:
   private def filesToRDD(files: Seq[String]): RDD[(K, V)] = {
 val fileRDDs = files.map(file = context.sparkContext.newAPIHadoopFile[K, 
 V, F](file))
 files.zip(fileRDDs).foreach { case (file, rdd) = {
   if (rdd.partitions.size == 0) {
 logError(File  + file +  has no data in it. Spark Streaming can 
 only ingest  +
   files that have been \moved\ to the directory assigned to the 
 file stream.  +
   Refer to the streaming programming guide for more details.)
   }
 }}
 new UnionRDD(context.sparkContext, fileRDDs)
   }
 // 
 ---
 modified code:
   private def filesToRDD(files: Seq[String]): RDD[(K, V)] = {
 val fileRDDs = for (file - files; rdd = 
 context.sparkContext.newAPIHadoopFile[K, V, F](file)) 
   yield {
   if (rdd.partitions.size == 0) {
 logError(File  + file +  has no data in it. Spark Streaming can 
 only ingest  +
   files that have been \moved\ to the directory assigned to the 
 file stream.  +
   Refer to the streaming programming guide for more details.)
   }
   rdd
 }
 new UnionRDD(context.sparkContext, fileRDDs)
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4220) Spark New Feature

2014-11-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4220.
--
Resolution: Invalid

Given two empty JIRAs were opened, I assume these were accidental.

 Spark New Feature
 -

 Key: SPARK-4220
 URL: https://issues.apache.org/jira/browse/SPARK-4220
 Project: Spark
  Issue Type: New Feature
Reporter: Tao Li





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4219) Spark New Feature

2014-11-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4219.
--
Resolution: Invalid

Given two empty JIRAs were opened, I assume these were accidental.

 Spark New Feature
 -

 Key: SPARK-4219
 URL: https://issues.apache.org/jira/browse/SPARK-4219
 Project: Spark
  Issue Type: New Feature
Reporter: Tao Li





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4196) Streaming + checkpointing yields NotSerializableException for Hadoop Configuration from saveAsNewAPIHadoopFiles ?

2014-11-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196762#comment-14196762
 ] 

Sean Owen commented on SPARK-4196:
--

Same problem I'm afraid. The serialization error suggests it's the fact that 
the configuration -- whatever its source in the caller -- is serialized in a 
call to foreachRDD in saveAsNewAPIHadoopFiles.

 Streaming + checkpointing yields NotSerializableException for Hadoop 
 Configuration from saveAsNewAPIHadoopFiles ?
 -

 Key: SPARK-4196
 URL: https://issues.apache.org/jira/browse/SPARK-4196
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Sean Owen

 I am reasonably sure there is some issue here in Streaming and that I'm not 
 missing something basic, but not 100%. I went ahead and posted it as a JIRA 
 to track, since it's come up a few times before without resolution, and right 
 now I can't get checkpointing to work at all.
 When Spark Streaming checkpointing is enabled, I see a 
 NotSerializableException thrown for a Hadoop Configuration object, and it 
 seems like it is not one from my user code.
 Before I post my particular instance see 
 http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1408135046777-12202.p...@n3.nabble.com%3E
  for another occurrence.
 I was also on customer site last week debugging an identical issue with 
 checkpointing in a Scala-based program and they also could not enable 
 checkpointing without hitting exactly this error.
 The essence of my code is:
 {code}
 final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
 JavaStreamingContextFactory streamingContextFactory = new
 JavaStreamingContextFactory() {
   @Override
   public JavaStreamingContext create() {
 return new JavaStreamingContext(sparkContext, new
 Duration(batchDurationMS));
   }
 };
   streamingContext = JavaStreamingContext.getOrCreate(
   checkpointDirString, sparkContext.hadoopConfiguration(),
 streamingContextFactory, false);
   streamingContext.checkpoint(checkpointDirString);
 {code}
 It yields:
 {code}
 2014-10-31 14:29:00,211 ERROR OneForOneStrategy:66
 org.apache.hadoop.conf.Configuration
 - field (class 
 org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9,
 name: conf$2, type: class org.apache.hadoop.conf.Configuration)
 - object (class
 org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9,
 function2)
 - field (class org.apache.spark.streaming.dstream.ForEachDStream,
 name: org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc,
 type: interface scala.Function2)
 - object (class org.apache.spark.streaming.dstream.ForEachDStream,
 org.apache.spark.streaming.dstream.ForEachDStream@cb8016a)
 ...
 {code}
 This looks like it's due to PairRDDFunctions, as this saveFunc seems
 to be  org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9
 :
 {code}
 def saveAsNewAPIHadoopFiles(
 prefix: String,
 suffix: String,
 keyClass: Class[_],
 valueClass: Class[_],
 outputFormatClass: Class[_ : NewOutputFormat[_, _]],
 conf: Configuration = new Configuration
   ) {
   val saveFunc = (rdd: RDD[(K, V)], time: Time) = {
 val file = rddToFileName(prefix, suffix, time)
 rdd.saveAsNewAPIHadoopFile(file, keyClass, valueClass,
 outputFormatClass, conf)
   }
   self.foreachRDD(saveFunc)
 }
 {code}
 Is that not a problem? but then I don't know how it would ever work in Spark. 
 But then again I don't see why this is an issue and only when checkpointing 
 is enabled.
 Long-shot, but I wonder if it is related to closure issues like 
 https://issues.apache.org/jira/browse/SPARK-1866



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4237) add Manifest File for Maven building

2014-11-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14197698#comment-14197698
 ] 

Sean Owen commented on SPARK-4237:
--

How does the PR address this? the manifest file is already built and included, 
and contains manifest entries from dependencies. Is this a problem?

 add Manifest File for Maven building
 

 Key: SPARK-4237
 URL: https://issues.apache.org/jira/browse/SPARK-4237
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


 Running with spark sql jdbc/odbc, the output will be 
 JackydeMacBook-Pro:spark1 jackylee$ bin/beeline 
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Beeline version ??? by Apache Hive
 we should add Manifest File for Maven building



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-11-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14198514#comment-14198514
 ] 

Sean Owen commented on SPARK-1297:
--

Ted are you updating a pull request? patches aren't used in this project.

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Minor
 Attachments: pom.xml, spark-1297-v2.txt, spark-1297-v4.txt, 
 spark-1297-v5.txt, spark-1297-v6.txt, spark-1297-v7.txt


 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4271) jetty Server can't tryport+1

2014-11-06 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4271.
--
Resolution: Duplicate

Well here's a good example. I'm sure this is duplicate of 
https://issues.apache.org/jira/browse/SPARK-4169 which is solved better in its 
PR, and this is still awaiting review/commit.

 jetty Server can't tryport+1
 

 Key: SPARK-4271
 URL: https://issues.apache.org/jira/browse/SPARK-4271
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: operating system language is Chinese.
Reporter: honestold3

 if operating system language is not Englisth, occur 
 org.apache.spark.util.Util.isBindCollision can't contains BingException 
 message. so jetty Server can't tryport+1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2014-11-06 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200396#comment-14200396
]

Sean Owen commented on SPARK-4231:
--

So this method basically computes where each test item would rank if you asked
for a list of recommendations that ranks every single item. It's not
necessarily efficient, but is simple. The reason I did it that way was to avoid
recreating a lot of the recommender ranking logic. I don't think one has to
define MAP this way -- I effectively averaged over all k to the # of items.

Yes I found the straightforward definition hard to implement at scale. I ended
up opting to compute an approximation of AUC for recommender eval in this next
version I'm working on:
https://github.com/OryxProject/oryx/blob/master/oryx-ml-mllib/src/main/java/com/cloudera/oryx/ml/mllib/als/AUC.java#L106
Sorry for the hard-to-read Java 7; going to redo this in Java 8 soon.
Basically you're just sampling random relevant/not-relevant pairs and comparing
their scores. You might consider that.

I dunno if it's worth bothering with a toy implementation in the examples. The
example is already just to show Spark really not ALS.

Add RankingMetrics to examples.MovieLensALS
---

Key: SPARK-4231
URL: https://issues.apache.org/jira/browse/SPARK-4231
Project: Spark
Issue Type: Improvement
Components: Examples
Affects Versions: 1.2.0
Reporter: Debasish Das
Fix For: 1.2.0

Original Estimate: 24h
Remaining Estimate: 24h

examples.MovieLensALS computes RMSE for movielens dataset but after addition
of RankingMetrics and enhancements to ALS, it is critical to look at not only
the RMSE but also measures like prec@k and MAP.
In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and
also added a flag that takes an input whether user/product recommendation is
being validated.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2014-11-06 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200470#comment-14200470
]

Sean Owen commented on SPARK-4231:
--

Yes I'm mostly questioning implementing this in examples. The definition in
RankingMetrics looks like the usual one to me -- average from 1 to min(# recs,
# relevant items). You could say the version you found above is 'extended' to
look into the long tail (# recs = # items), although the long tail doesn't
affect MAP much. Same definition, different limit.

precision@k does not have the same question since there is one k value, not
lots.
AUC may not help you if you're comparing to other things for which you don't
have AUC. It was a side comment mostly.
(Anyway there is already an AUC implementation here which I am trying to see if
I can use.)

Add RankingMetrics to examples.MovieLensALS
---

Key: SPARK-4231
URL: https://issues.apache.org/jira/browse/SPARK-4231
Project: Spark
Issue Type: Improvement
Components: Examples
Affects Versions: 1.2.0
Reporter: Debasish Das
Fix For: 1.2.0

Original Estimate: 24h
Remaining Estimate: 24h

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4276) Spark streaming requires at least two working thread

2014-11-06 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200539#comment-14200539
 ] 

Sean Owen commented on SPARK-4276:
--

This is basically the same concern addressed already by 
https://issues.apache.org/jira/browse/SPARK-4040 no?
This code can set a master of, say, local[2], but it was my understanding 
that all examples don't set a master and this is supplied by spark-submit.
Then again SPARK-4040 changed the doc example to set a local[2] master. hm.


 Spark streaming requires at least two working thread
 

 Key: SPARK-4276
 URL: https://issues.apache.org/jira/browse/SPARK-4276
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: varun sharma
 Fix For: 1.1.0


 Spark streaming requires at least two working threads.But example in 
 spark/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala
  
 has
 // Create the context with a 1 second batch size
 val sparkConf = new SparkConf().setAppName(NetworkWordCount)
 val ssc = new StreamingContext(sparkConf, Seconds(1))
 which creates only 1 thread.
 It should have atleast 2 threads:
 http://spark.apache.org/docs/latest/streaming-programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.

2014-11-07 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201827#comment-14201827
 ] 

Sean Owen commented on SPARK-4289:
--

This is a Hadoop issue, right? I don't know if Spark can address this directly 
I suppose you could work around this with :silent in the shell.

 Creating an instance of Hadoop Job fails in the Spark shell when toString() 
 is called on the instance.
 --

 Key: SPARK-4289
 URL: https://issues.apache.org/jira/browse/SPARK-4289
 Project: Spark
  Issue Type: Bug
Reporter: Corey J. Nolet

 This one is easy to reproduce.
 preval job = new Job(sc.hadoopConfiguration)/pre
 I'm not sure what the solution would be off hand as it's happening when the 
 shell is calling toString() on the instance of Job. The problem is, because 
 of the failure, the instance is never actually assigned to the job val.
 java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
   at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
   at org.apache.hadoop.mapreduce.Job.toString(Job.java:452)
   at 
 scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
   at .init(console:10)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4288) Add Sparse Autoencoder algorithm to MLlib

2014-11-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4288:
-
 Description: Are you proposing an implementation? Is it related to the 
neural network JIRA?
Target Version/s:   (was: 1.3.0)
  Issue Type: Wish  (was: Bug)

 Add Sparse Autoencoder algorithm to MLlib 
 --

 Key: SPARK-4288
 URL: https://issues.apache.org/jira/browse/SPARK-4288
 Project: Spark
  Issue Type: Wish
  Components: MLlib
Reporter: Guoqiang Li
  Labels: features

 Are you proposing an implementation? Is it related to the neural network JIRA?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4283) Spark source code does not correctly import into eclipse

2014-11-07 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201853#comment-14201853
 ] 

Sean Owen commented on SPARK-4283:
--

This is really an Eclipse problem. I don't personally think it's worth the 
extra weight in the build for this. (Use pull requests, not patches on JIRAs, 
in Spark.)

 Spark source code does not correctly import into eclipse
 

 Key: SPARK-4283
 URL: https://issues.apache.org/jira/browse/SPARK-4283
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Yang Yang
Priority: Minor
 Attachments: spark_eclipse.diff


 when I import spark src into eclipse, either by mvn eclipse:eclipse, then 
 import existing general projects or import existing maven projects, it 
 does not recognize the project as a scala project. 
 I am adding a new plugin , so import works



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

301 - 400 of 16038 matches

Mail list logo