RE: Intermedate stage will be cached automatically ?

2015-06-17 Thread Mark Tse
I think 
https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence 
might shed some light on the behaviour you’re seeing.

Mark

From: canan chen [mailto:ccn...@gmail.com]
Sent: June-17-15 5:57 AM
To: spark users
Subject: Intermedate stage will be cached automatically ?

Here's one simple spark example that I call RDD#count 2 times. The first time 
it would invoke 2 stages, but the second one only need 1 stage. Seems the first 
stage is cached. Is that true ? Any flag can I control whether the cache the 
intermediate stage

val data = sc.parallelize(1 to 10, 2).map(e=(e%2,2)).reduceByKey(_ + _, 2)
println(data.count())
println(data.count())


Unit Testing Spark Transformations/Actions

2015-06-16 Thread Mark Tse
Hi there,

I am looking to use Mockito to mock out some functionality while unit testing a 
Spark application.

I currently have code that happily runs on a cluster, but fails when I try to 
run unit tests against it, throwing a SparkException:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 
1, localhost): java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1.f$14 of type 
org.apache.spark.api.java.function.VoidFunction in instance of 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1
at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2089)

(Full error/stacktrace and description on SO: 
http://stackoverflow.com/q/30871109/2687324).

Has anyone experienced this error before while unit testing?

Thanks,
Mark


ReduceByKey with a byte array as the key

2015-06-11 Thread Mark Tse
I would like to work with RDD pairs of Tuple2byte[], obj, but byte[]s with 
the same contents are considered as different values because their reference 
values are different.

I didn't see any to pass in a custom comparer. I could convert the byte[] into 
a String with an explicit charset, but I'm wondering if there's a more 
efficient way.

Also posted on SO: http://stackoverflow.com/q/30785615/2687324

Thanks,
Mark


RE: ReduceByKey with a byte array as the key

2015-06-11 Thread Mark Tse
Makes sense – I suspect what you suggested should work.

However, I think the overhead between this and using `String` would be similar 
enough to warrant just using `String`.

Mark

From: Sonal Goyal [mailto:sonalgoy...@gmail.com]
Sent: June-11-15 12:58 PM
To: Mark Tse
Cc: user@spark.apache.org
Subject: Re: ReduceByKey with a byte array as the key

I think if you wrap the byte[] into an object and implement equals and hashcode 
methods, you may be able to do this. There will be the overhead of extra 
object, but conceptually it should work unless I am missing something.

Best Regards,
Sonal
Founder, Nube Technologieshttp://www.nubetech.co
Check out Reifier at Spark Summit 
2015https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/




On Thu, Jun 11, 2015 at 9:27 PM, Mark Tse 
mark@d2l.commailto:mark@d2l.com wrote:
I would like to work with RDD pairs of Tuple2byte[], obj, but byte[]s with 
the same contents are considered as different values because their reference 
values are different.

I didn't see any to pass in a custom comparer. I could convert the byte[] into 
a String with an explicit charset, but I'm wondering if there's a more 
efficient way.

Also posted on SO: http://stackoverflow.com/q/30785615/2687324

Thanks,
Mark