Hi Mark,
Could you tell me more detail information about how to short-circuit the
filtering ?
Thanks.
Xiang
2013/11/4 Mark Hamstra
> You could short-circuit the filtering within the interator function
> supplied to mapPartitions.
>
>
> On Sunday, November 3, 2013, Xiang Huo wrote:
>
>> Hi al
You could short-circuit the filtering within the interator function
supplied to mapPartitions.
On Sunday, November 3, 2013, Xiang Huo wrote:
> Hi all,
>
> I am trying to filter a smaller RDD data set from a large RDD data set.
> And the large one is sorted. So my question is that is there any wa
Element-wise: that sounds like a sequential control flow whereas RDDs
are inherently parallel collections. I'm also interested to know if
it's possible.
Partition-wise: PartitionPruningRDD [1] may be of help.
[1]
http://spark.incubator.apache.org/docs/0.8.0/api/core/org/apache/spark/rdd/Partiti
Hi all,
I am trying to filter a smaller RDD data set from a large RDD data set. And
the large one is sorted. So my question is that is there any way to make
the filter method does't check every element in RDD but filter out all the
other elements when one element doesn't meet the condition of filt
Yea so every inner class actually contains a field referencing the outer
class. In your case, the anonymous class DoubleFlatMapFunction actually has
a this$0 field referencing the outer class AnalyticsEngine, and thus why
Spark is trying to serialize AnalyticsEngine.
In the Scala API, the closure
I met the problem of 'Too many open files' before. One solution is adding
'ulimit -n 10' in the spark-env.sh file.
Basically, I think the local variable may not be a problem as I have
written programs with local variables as parameters for functions and the
programs work.
2013/11/3 Walrus the
Hm, I think you are triggering a bug in the Java API where closures
may not be properly cleaned. I think @rxin has reproduced this,
deferring to him.
- Patrick
On Sun, Nov 3, 2013 at 5:25 PM, Yadid Ayzenberg wrote:
> code is below. in the code rdd.count() works, but rdd2.count() fails.
>
> publi
Thanks, my case seems not caused by GC, cpu is pretty low and both YGC and FGC
seems behavior quite normal. Hmm, weird.
Best Regards,
Raymond Liu
From: Aaron Davidson [mailto:ilike...@gmail.com]
Sent: Saturday, November 02, 2013 12:07 AM
To: user@spark.incubator.apache.org
Subject: Re: Executor
code is below. in the code rdd.count() works, but rdd2.count() fails.
public class AnalyticsEngine implements Serializable {
private static AnalyticsEngine engine=null;
private JavaSparkContext sc;
final Logger logger = LoggerFactory.getLogger(AnalyticsEngine.class);
private Pr
Thanks that would help. This would be consistent with there being a
reference to the SparkContext itself inside of the closure. Just want
to make sure that's not the case.
On Sun, Nov 3, 2013 at 5:13 PM, Yadid Ayzenberg wrote:
> Im running in local[4] mode - so there are no slave machines. Full s
Im running in local[4] mode - so there are no slave machines. Full stack
trace:
(run-main) org.apache.spark.SparkException: Job failed:
java.io.NotSerializableException: edu.mit.bsense.AnalyticsEngine
org.apache.spark.SparkException: Job failed:
java.io.NotSerializableException: edu.mit.bsense
If you look in the UI, are there failures on any of the slaves that
you can give a stack trace for? That would narrow down where the
serialization error is happening.
Unfortunately this code path doesn't print a full stack trace which
makes it harder to debug where the serialization error comes f
yes, I tried that as well (it is currently registered with Kryo)-
although it doesnt make sense to me (and doesnt solve the problem). I
also made sure my registration was running:
DEBUG org.apache.spark.serializer.KryoSerializer - Running user
registrator: edu.mit.bsense.MyRegistrator
7841 [spa
edu.mit.bsense.AnalyticsEngine
Look at the exception. Basically, you'll need to register every class
type that is recursively used by BSONObject.
On Sun, Nov 3, 2013 at 4:27 PM, Yadid Ayzenberg wrote:
> Hi Patrick,
>
> I am in fact using Kryo and im registering BSONObject.class (which is class
Hi Patrick,
I am in fact using Kryo and im registering BSONObject.class (which is
class holding the data) in my KryoRegistrator.
Im not sure what other classes I should be registering.
Thanks,
Yadid
On 11/3/13 7:23 PM, Patrick Wendell wrote:
The problem is you are referencing a class that
The problem is you are referencing a class that does not "extend
serializable" in the data that you shuffle. Spark needs to send all
shuffle data over the network, so it needs to know how to serialize
them.
One option is to use Kryo for network serialization as described here
- you'll have to regi
Hi Shangyu,
I appreciate your ongoing correspondence. To clarify, my solution didn't
work, and I didn't expect it to. I was digging through the logs, and I
found a series of exceptions (in only one of the workers):
13/11/03 17:51:05 INFO client.DefaultHttpClient: Retrying connect
13/11/03 17:51:
Hi Walrus,
Thank you for sharing your solution to your problem. I think I have met the
similar problem before (i.e., one machine is working while others are
idle.) and I just waits for a long time and the program will continue
processing. I am not sure how your program filters an RDD by a locally
s
Hi All,
My original RDD contains arrays of doubles. when appying a count()
operator to the original RDD I get the result as expected.
However when I run a map on the original RDD in order to generate a new
RDD with only the first element of each array, and try to apply count()
to the new gener
Hi Shangyu,
Thanks for responding. This is a refactor of other code that isn't
completely scalable because it pulls stuff to the driver. This code keeps
everything on the cluster. I left it running for 7 hours, and the log just
froze. I checked ganglia, and only one machine's CPU seemed to be
20 matches
Mail list logo