Re: How to filter a sorted RDD

2013-11-03 Thread Xiang Huo
Hi Mark, Could you tell me more detail information about how to short-circuit the filtering ? Thanks. Xiang 2013/11/4 Mark Hamstra > You could short-circuit the filtering within the interator function > supplied to mapPartitions. > > > On Sunday, November 3, 2013, Xiang Huo wrote: > >> Hi al

Re: How to filter a sorted RDD

2013-11-03 Thread Mark Hamstra
You could short-circuit the filtering within the interator function supplied to mapPartitions. On Sunday, November 3, 2013, Xiang Huo wrote: > Hi all, > > I am trying to filter a smaller RDD data set from a large RDD data set. > And the large one is sorted. So my question is that is there any wa

Re: How to filter a sorted RDD

2013-11-03 Thread Zongheng Yang
Element-wise: that sounds like a sequential control flow whereas RDDs are inherently parallel collections. I'm also interested to know if it's possible. Partition-wise: PartitionPruningRDD [1] may be of help. [1] http://spark.incubator.apache.org/docs/0.8.0/api/core/org/apache/spark/rdd/Partiti

How to filter a sorted RDD

2013-11-03 Thread Xiang Huo
Hi all, I am trying to filter a smaller RDD data set from a large RDD data set. And the large one is sorted. So my question is that is there any way to make the filter method does't check every element in RDD but filter out all the other elements when one element doesn't meet the condition of filt

Re: java.io.NotSerializableException on RDD count() in Java

2013-11-03 Thread Reynold Xin
Yea so every inner class actually contains a field referencing the outer class. In your case, the anonymous class DoubleFlatMapFunction actually has a this$0 field referencing the outer class AnalyticsEngine, and thus why Spark is trying to serialize AnalyticsEngine. In the Scala API, the closure

Re: cluster hangs for no apparent reason

2013-11-03 Thread Shangyu Luo
I met the problem of 'Too many open files' before. One solution is adding 'ulimit -n 10' in the spark-env.sh file. Basically, I think the local variable may not be a problem as I have written programs with local variables as parameters for functions and the programs work. 2013/11/3 Walrus the

Re: java.io.NotSerializableException on RDD count() in Java

2013-11-03 Thread Patrick Wendell
Hm, I think you are triggering a bug in the Java API where closures may not be properly cleaned. I think @rxin has reproduced this, deferring to him. - Patrick On Sun, Nov 3, 2013 at 5:25 PM, Yadid Ayzenberg wrote: > code is below. in the code rdd.count() works, but rdd2.count() fails. > > publi

RE: Executor could not connect to Driver?

2013-11-03 Thread Liu, Raymond
Thanks, my case seems not caused by GC, cpu is pretty low and both YGC and FGC seems behavior quite normal. Hmm, weird. Best Regards, Raymond Liu From: Aaron Davidson [mailto:ilike...@gmail.com] Sent: Saturday, November 02, 2013 12:07 AM To: user@spark.incubator.apache.org Subject: Re: Executor

Re: java.io.NotSerializableException on RDD count() in Java

2013-11-03 Thread Yadid Ayzenberg
code is below. in the code rdd.count() works, but rdd2.count() fails. public class AnalyticsEngine implements Serializable { private static AnalyticsEngine engine=null; private JavaSparkContext sc; final Logger logger = LoggerFactory.getLogger(AnalyticsEngine.class); private Pr

Re: java.io.NotSerializableException on RDD count() in Java

2013-11-03 Thread Patrick Wendell
Thanks that would help. This would be consistent with there being a reference to the SparkContext itself inside of the closure. Just want to make sure that's not the case. On Sun, Nov 3, 2013 at 5:13 PM, Yadid Ayzenberg wrote: > Im running in local[4] mode - so there are no slave machines. Full s

Re: java.io.NotSerializableException on RDD count() in Java

2013-11-03 Thread Yadid Ayzenberg
Im running in local[4] mode - so there are no slave machines. Full stack trace: (run-main) org.apache.spark.SparkException: Job failed: java.io.NotSerializableException: edu.mit.bsense.AnalyticsEngine org.apache.spark.SparkException: Job failed: java.io.NotSerializableException: edu.mit.bsense

Re: java.io.NotSerializableException on RDD count() in Java

2013-11-03 Thread Patrick Wendell
If you look in the UI, are there failures on any of the slaves that you can give a stack trace for? That would narrow down where the serialization error is happening. Unfortunately this code path doesn't print a full stack trace which makes it harder to debug where the serialization error comes f

Re: java.io.NotSerializableException on RDD count() in Java

2013-11-03 Thread Yadid Ayzenberg
yes, I tried that as well (it is currently registered with Kryo)- although it doesnt make sense to me (and doesnt solve the problem). I also made sure my registration was running: DEBUG org.apache.spark.serializer.KryoSerializer - Running user registrator: edu.mit.bsense.MyRegistrator 7841 [spa

Re: java.io.NotSerializableException on RDD count() in Java

2013-11-03 Thread Patrick Wendell
edu.mit.bsense.AnalyticsEngine Look at the exception. Basically, you'll need to register every class type that is recursively used by BSONObject. On Sun, Nov 3, 2013 at 4:27 PM, Yadid Ayzenberg wrote: > Hi Patrick, > > I am in fact using Kryo and im registering BSONObject.class (which is class

Re: java.io.NotSerializableException on RDD count() in Java

2013-11-03 Thread Yadid Ayzenberg
Hi Patrick, I am in fact using Kryo and im registering BSONObject.class (which is class holding the data) in my KryoRegistrator. Im not sure what other classes I should be registering. Thanks, Yadid On 11/3/13 7:23 PM, Patrick Wendell wrote: The problem is you are referencing a class that

Re: java.io.NotSerializableException on RDD count() in Java

2013-11-03 Thread Patrick Wendell
The problem is you are referencing a class that does not "extend serializable" in the data that you shuffle. Spark needs to send all shuffle data over the network, so it needs to know how to serialize them. One option is to use Kryo for network serialization as described here - you'll have to regi

Re: cluster hangs for no apparent reason

2013-11-03 Thread Walrus theCat
Hi Shangyu, I appreciate your ongoing correspondence. To clarify, my solution didn't work, and I didn't expect it to. I was digging through the logs, and I found a series of exceptions (in only one of the workers): 13/11/03 17:51:05 INFO client.DefaultHttpClient: Retrying connect 13/11/03 17:51:

Re: cluster hangs for no apparent reason

2013-11-03 Thread Shangyu Luo
Hi Walrus, Thank you for sharing your solution to your problem. I think I have met the similar problem before (i.e., one machine is working while others are idle.) and I just waits for a long time and the program will continue processing. I am not sure how your program filters an RDD by a locally s

java.io.NotSerializableException on RDD count() in Java

2013-11-03 Thread Yadid Ayzenberg
Hi All, My original RDD contains arrays of doubles. when appying a count() operator to the original RDD I get the result as expected. However when I run a map on the original RDD in order to generate a new RDD with only the first element of each array, and try to apply count() to the new gener

Re: cluster hangs for no apparent reason

2013-11-03 Thread Walrus theCat
Hi Shangyu, Thanks for responding. This is a refactor of other code that isn't completely scalable because it pulls stuff to the driver. This code keeps everything on the cluster. I left it running for 7 hours, and the log just froze. I checked ganglia, and only one machine's CPU seemed to be