Re: [spark on yarn] spark on yarn without DFS
This is interesting. Would really appreciate it if you could share what exactly did you change in* core-site.xml *and *yarn-site.xml.* On Wed, May 22, 2019 at 9:14 AM Gourav Sengupta wrote: > just wondering what is the advantage of doing this? > > Regards > Gourav Sengupta > > On Wed, May 22, 2019 at 3:01 AM Huizhe Wang > wrote: > >> Hi Hari, >> Thanks :) I tried to do it as u said. It works ;) >> >> >> Hariharan 于2019年5月20日 周一下午3:54写道: >> >>> Hi Huizhe, >>> >>> You can set the "fs.defaultFS" field in core-site.xml to some path on >>> s3. That way your spark job will use S3 for all operations that need HDFS. >>> Intermediate data will still be stored on local disk though. >>> >>> Thanks, >>> Hari >>> >>> On Mon, May 20, 2019 at 10:14 AM Abdeali Kothari < >>> abdealikoth...@gmail.com> wrote: >>> While spark can read from S3 directly in EMR, I believe it still needs the HDFS to perform shuffles and to write intermediate data into disk when doing jobs (I.e. when the in memory need stop spill over to disk) For these operations, Spark does need a distributed file system - You could use something like EMRFS (which is like a HDFS backed by S3) on Amazon. The issue could be something else too - so a stacktrace or error message could help in understanding the problem. On Mon, May 20, 2019, 07:20 Huizhe Wang wrote: > Hi, > > I wanna to use Spark on Yarn without HDFS.I store my resource in AWS > and using s3a to get them. However, when I use stop-dfs.sh stoped Namenode > and DataNode. I got an error when using yarn cluster mode. Could I using > yarn without start DFS, how could I use this mode? > > Yours, > Jane >
Re: Executors idle, driver heap exploding and maxing only 1 cpu core
One potential case that can cause this is the optimizer being a little overzealous with determining if a table can be broadcasted or not. Have you checked the UI or query plan to see if any steps include a BroadcastHashJoin? Its possible that the optimizer thinks that it should be able to fit the table in memory from looking at its size on disk, but it actually cannot fit in memory. In this case you might want to look at tuning the autoBroadcastJoinThreshold. Another potential case is that at the step it looks like the driver is "hanging" its attempting to load in a data source that is backed by a very large number of files. Spark maintains a cache of file paths for a data source to determine task splits, and we've seen the driver appear to hang and/or crash if you try to load in thousands (or more) of tiny files per partition, and you have a large number of partitions. Hope this helps. Nicholas Szandor Hakobian, Ph.D. Principal Data Scientist Rally Health nicholas.hakob...@rallyhealth.com On Thu, May 23, 2019 at 7:36 AM Ashic Mahtab wrote: > Hi, > We have a quite long winded Spark application we inherited with many > stages. When we run on our spark cluster, things start off well enough. > Workers are busy, lots of progress made, etc. etc. However, 30 minutes into > processing, we see CPU usage of the workers drop drastically. At this time, > we also see that the driver is maxing out exactly one core (though we've > given it more than one), and its ram usage is creeping up. At this time, > there's no logs coming out on the driver. Everything seems to stop, and > then it suddenly starts working, and the workers start working again. The > driver ram doesn't go down, but flatlines. A few minutes later, the same > thing happens again - the world seems to stop. However, the driver soon > crashes with an out of memory exception. > > What could be causing this sort of behaviour on the driver? We don't have > any collect() or similar functions in the code. We're reading in from Azure > blobs, processing, and writing back to Azure blobs. Where should we start > in trying to get to the bottom of this? We're running Spark 2.4.1 in a > stand-alone cluster. > > Thanks, > Ashic. >
unsubscribe
Executors idle, driver heap exploding and maxing only 1 cpu core
Hi, We have a quite long winded Spark application we inherited with many stages. When we run on our spark cluster, things start off well enough. Workers are busy, lots of progress made, etc. etc. However, 30 minutes into processing, we see CPU usage of the workers drop drastically. At this time, we also see that the driver is maxing out exactly one core (though we've given it more than one), and its ram usage is creeping up. At this time, there's no logs coming out on the driver. Everything seems to stop, and then it suddenly starts working, and the workers start working again. The driver ram doesn't go down, but flatlines. A few minutes later, the same thing happens again - the world seems to stop. However, the driver soon crashes with an out of memory exception. What could be causing this sort of behaviour on the driver? We don't have any collect() or similar functions in the code. We're reading in from Azure blobs, processing, and writing back to Azure blobs. Where should we start in trying to get to the bottom of this? We're running Spark 2.4.1 in a stand-alone cluster. Thanks, Ashic.
PySpark Streaming “PicklingError: Could not serialize object” when use transform operator and checkpoint enabled
In PySpark streaming, if checkpoint enabled, and if use a stream.transform operator to join with another rdd, “PicklingError: Could not serialize object” will be thrown. I have asked the same question at stackoverflow: https://stackoverflow.com/questions/56267591/pyspark-streaming-picklingerror-could-not-serialize-object-when-checkpoint-an After some investigation, I found the problem is due to checkpoint will serialize lambda and then serialize the rdd in lambda. So I change the code to something like below, the purpose is to use a static transient variable to avoid serialize rdd. class DocInfoHolder: doc_info = None line.transform(lambda rdd:rdd.join(DocInfoHolder.doc_info)).pprint(10) But problem exist still. Then I found pyspark use a special pickle called cloudpickle.py, looks like it will serialize any reference class, function, lambda code, and there is no document about how to skip serialize. Could anyone help, how to walk around this issue. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org