Re: [spark on yarn] spark on yarn without DFS

2019-05-23 Thread Achilleus 003
This is interesting. Would really appreciate it if you could share what
exactly did you change in* core-site.xml *and *yarn-site.xml.*

On Wed, May 22, 2019 at 9:14 AM Gourav Sengupta 
wrote:

> just wondering what is the advantage of doing this?
>
> Regards
> Gourav Sengupta
>
> On Wed, May 22, 2019 at 3:01 AM Huizhe Wang 
> wrote:
>
>> Hi Hari,
>> Thanks :) I tried to do it as u said. It works ;)
>>
>>
>> Hariharan 于2019年5月20日 周一下午3:54写道:
>>
>>> Hi Huizhe,
>>>
>>> You can set the "fs.defaultFS" field in core-site.xml to some path on
>>> s3. That way your spark job will use S3 for all operations that need HDFS.
>>> Intermediate data will still be stored on local disk though.
>>>
>>> Thanks,
>>> Hari
>>>
>>> On Mon, May 20, 2019 at 10:14 AM Abdeali Kothari <
>>> abdealikoth...@gmail.com> wrote:
>>>
 While spark can read from S3 directly in EMR, I believe it still needs
 the HDFS to perform shuffles and to write intermediate data into disk when
 doing jobs (I.e. when the in memory need stop spill over to disk)

 For these operations, Spark does need a distributed file system - You
 could use something like EMRFS (which is like a HDFS backed by S3) on
 Amazon.

 The issue could be something else too - so a stacktrace or error
 message could help in understanding the problem.



 On Mon, May 20, 2019, 07:20 Huizhe Wang 
 wrote:

> Hi,
>
> I wanna to use Spark on Yarn without HDFS.I store my resource in AWS
> and using s3a to get them. However, when I use stop-dfs.sh stoped Namenode
> and DataNode. I got an error when using yarn cluster mode. Could I using
> yarn without start DFS, how could I use this mode?
>
> Yours,
> Jane
>



Re: Executors idle, driver heap exploding and maxing only 1 cpu core

2019-05-23 Thread Nicholas Hakobian
One potential case that can cause this is the optimizer being a little
overzealous with determining if a table can be broadcasted or not. Have you
checked the UI or query plan to see if any steps include a
BroadcastHashJoin? Its possible that the optimizer thinks that it should be
able to fit the table in memory from looking at its size on disk, but it
actually cannot fit in memory. In this case you might want to look at
tuning the autoBroadcastJoinThreshold.

Another potential case is that at the step it looks like the driver is
"hanging" its attempting to load in a data source that is backed by a very
large number of files. Spark maintains a cache of file paths for a data
source to determine task splits, and we've seen the driver appear to hang
and/or crash if you try to load in thousands (or more) of tiny files per
partition, and you have a large number of partitions.

Hope this helps.

Nicholas Szandor Hakobian, Ph.D.
Principal Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com


On Thu, May 23, 2019 at 7:36 AM Ashic Mahtab  wrote:

> Hi,
> We have a quite long winded Spark application we inherited with many
> stages. When we run on our spark cluster, things start off well enough.
> Workers are busy, lots of progress made, etc. etc. However, 30 minutes into
> processing, we see CPU usage of the workers drop drastically. At this time,
> we also see that the driver is maxing out exactly one core (though we've
> given it more than one), and its ram usage is creeping up. At this time,
> there's no logs coming out on the driver. Everything seems to stop, and
> then it suddenly starts working, and the workers start working again. The
> driver ram doesn't go down, but flatlines. A few minutes later, the same
> thing happens again - the world seems to stop. However, the driver soon
> crashes with an out of memory exception.
>
> What could be causing this sort of behaviour on the driver? We don't have
> any collect() or similar functions in the code. We're reading in from Azure
> blobs, processing, and writing back to Azure blobs. Where should we start
> in trying to get to the bottom of this? We're running Spark 2.4.1 in a
> stand-alone cluster.
>
> Thanks,
> Ashic.
>


unsubscribe

2019-05-23 Thread Mun, Woyou - US



Executors idle, driver heap exploding and maxing only 1 cpu core

2019-05-23 Thread Ashic Mahtab
Hi,
We have a quite long winded Spark application we inherited with many stages. 
When we run on our spark cluster, things start off well enough. Workers are 
busy, lots of progress made, etc. etc. However, 30 minutes into processing, we 
see CPU usage of the workers drop drastically. At this time, we also see that 
the driver is maxing out exactly one core (though we've given it more than 
one), and its ram usage is creeping up. At this time, there's no logs coming 
out on the driver. Everything seems to stop, and then it suddenly starts 
working, and the workers start working again. The driver ram doesn't go down, 
but flatlines. A few minutes later, the same thing happens again - the world 
seems to stop. However, the driver soon crashes with an out of memory exception.

What could be causing this sort of behaviour on the driver? We don't have any 
collect() or similar functions in the code. We're reading in from Azure blobs, 
processing, and writing back to Azure blobs. Where should we start in trying to 
get to the bottom of this? We're running Spark 2.4.1 in a stand-alone cluster.

Thanks,
Ashic.


PySpark Streaming “PicklingError: Could not serialize object” when use transform operator and checkpoint enabled

2019-05-23 Thread Xilang Yan
In PySpark streaming, if checkpoint enabled, and if use a stream.transform
operator to join with another rdd, “PicklingError: Could not serialize
object” will be thrown. I have asked the same question at stackoverflow:
https://stackoverflow.com/questions/56267591/pyspark-streaming-picklingerror-could-not-serialize-object-when-checkpoint-an

After some investigation, I found the problem is due to checkpoint will
serialize lambda and then serialize the rdd in lambda. So I change the code
to something like below, the purpose is to use a static transient variable 
to avoid serialize rdd.

class DocInfoHolder:
doc_info = None

line.transform(lambda rdd:rdd.join(DocInfoHolder.doc_info)).pprint(10)

But problem exist still. Then I found pyspark use a special pickle called
cloudpickle.py, looks like it will serialize any reference class, function,
lambda code, and there is no document about how to skip serialize. Could
anyone help, how to walk around this issue.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org