oducer and even couldn't reproduce even
> they spent their time. Memory leak issue is not really easy to reproduce,
> unless it leaks some objects without any conditions.
>
> - Jungtaek Lim (HeartSaVioR)
>
> On Sun, Oct 20, 2019 at 7:18 PM Paul Wais wrote:
>>
>> Dear
Dear List,
I've observed some sort of memory leak when using pyspark to run ~100
jobs in local mode. Each job is essentially a create RDD -> create DF
-> write DF sort of flow. The RDD and DFs go out of scope after each
job completes, hence I call this issue a "memory leak." Here's
pseudocode:
Dear List,
Has anybody gotten avro support to work in pyspark? I see multiple
reports of it being broken on Stackoverflow and added my own repro to
this ticket:
Dear List,
I'm investigating some problems related to native code integration
with Spark, and while picking through BlockManager I noticed that data
(de)serialization currently issues lots of array copies.
Specifically:
- Deserialization: BlockManager marshals all deserialized bytes
through a
/15 9:51 PM, Paul Wais wrote:
Dear List,
What are common approaches for addressing over a union of tables / RDDs?
E.g. suppose I have a collection of log files in HDFS, one log file per day,
and I want to compute the sum of some field over a date range in SQL. Using
log schema, I can read
To force one instance per executor, you could explicitly subclass
FlatMapFunction and have it lazy-create your parser in the subclass
constructor. You might also want to try RDD#mapPartitions() (instead of
RDD#flatMap() if you want one instance per partition. This approach worked
well for me
Dear List,
What are common approaches for addressing over a union of tables / RDDs?
E.g. suppose I have a collection of log files in HDFS, one log file per
day, and I want to compute the sum of some field over a date range in SQL.
Using log schema, I can read each as a distinct SchemaRDD, but I
More thoughts. I took a deeper look at BlockManager, RDD, and friends.
Suppose one wanted to get native code access to un-deserialized blocks.
This task looks very hard. An RDD behaves much like a Scala iterator of
deserialized values, and interop with BlockManager is all on deserialized
data.
Dear List,
Has anybody had experience integrating C/C++ code into Spark jobs?
I have done some work on this topic using JNA. I wrote a FlatMapFunction
that processes all partition entries using a C++ library. This approach
works well, but there are some tradeoffs:
* Shipping the native
taking memory.
On Oct 30, 2014 6:43 PM, Paul Wais pw...@yelp.com javascript:;
wrote:
Dear Spark List,
I have a Spark app that runs native code inside map functions. I've
noticed that the native code sometimes sets errno to ENOMEM indicating
a lack of available memory. However, I've
Looks like an OOM issue? Have you tried persisting your RDDs to allow
disk writes?
I've seen a lot of similar crashes in a Spark app that reads from HDFS
and does joins. I.e. I've seen java.io.IOException: Filesystem
closed, Executor lost, FetchFailed, etc etc with
non-deterministic crashes.
Well it looks like this is indeed a protobuf issue. Poked a little more
with Kryo. Since protobuf messages are serializable, I tried just making
Kryo use the JavaSerializer for my messages. The resulting stack trace
made it look like protobuf GeneratedMessageLite is actually using the
Derp, one caveat to my solution: I guess Spark doesn't use Kryo for
Function serde :(
On Fri, Sep 19, 2014 at 12:44 AM, Paul Wais pw...@yelp.com wrote:
Well it looks like this is indeed a protobuf issue. Poked a little more
with Kryo. Since protobuf messages are serializable, I tried just
Dear List,
I'm writing an application where I have RDDs of protobuf messages.
When I run the app via bin/spar-submit with --master local
--driver-class-path path/to/my/uber.jar, Spark is able to
ser/deserialize the messages correctly.
However, if I run WITHOUT --driver-class-path
/hadoop-project/pom.xml
On Thu, Sep 18, 2014 at 1:06 AM, Paul Wais pw...@yelp.com wrote:
Dear List,
I'm writing an application where I have RDDs of protobuf messages.
When I run the app via bin/spar-submit with --master local
--driver-class-path path/to/my/uber.jar, Spark is able to
ser
* https://github.com/apache/spark/pull/181
*
http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3c7f6aa9e820f55d4a96946a87e086ef4a4bcdf...@eagh-erfpmbx41.erf.thomson.com%3E
* https://groups.google.com/forum/#!topic/spark-users/Q66UOeA2u-I
On Thu, Sep 18, 2014 at 4:51 PM, Paul Wais pw
hmm would using kyro help me here?
On Thursday, September 18, 2014, Paul Wais pw...@yelp.com wrote:
Ah, can one NOT create an RDD of any arbitrary Serializable type? It
looks like I might be getting bitten by the same
java.io.ObjectInputStream uses root class loader only bugs mentioned
/2f9b2bd7844ee8393dc9c319f4fefedf95f5e460/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L74
If uber.jar is on the classpath, then the root classloader would have
the code, hence why --driver-class-path fixes the bug.
On Thu, Sep 18, 2014 at 5:42 PM, Paul Wais pw...@yelp.com wrote
--master yarn-client ...
will fail.
But on the personal build obtained from the command above, both will then
work.
-Christian
On Sep 15, 2014, at 6:28 PM, Paul Wais pw...@yelp.com wrote:
Dear List,
I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
reading SequenceFiles
, when it shouldn't be packaged.
Spark works out of the box with just about any modern combo of HDFS and YARN.
On Tue, Sep 16, 2014 at 2:28 AM, Paul Wais pw...@yelp.com wrote:
Dear List,
I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
reading SequenceFiles. In particular
Dear List,
I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
reading SequenceFiles. In particular, I'm seeing:
Exception in thread main org.apache.hadoop.ipc.RemoteException:
Server IPC version 7 cannot communicate with client version 4
at
in the second half of next month (or shortly
thereafter).
On Wed, Jul 16, 2014 at 4:03 PM, Paul Wais pw...@yelp.com wrote:
Dear List,
The version of pyspark on master has a lot of nice new features, e.g.
SequenceFile reading, pickle i/o, etc:
https://github.com/apache/spark/blob/master
,
-Paul Wais
23 matches
Mail list logo