Re: spark application running in yarn client mode is slower than in local mode.

2018-04-09 Thread Junfeng Chen
But I still have one question. I find the task number in stage is 3. So where is this 3 from? How to increase the parallelism? Regard, Junfeng Chen On Tue, Apr 10, 2018 at 11:31 AM, Junfeng Chen wrote: > Yeah, I have increase the executor number and executor cores, and it runs > normally now.

pyspark.daemon exhaust a lot of memory

2018-04-09 Thread Niu Zhaojie
Hi All, We are running spark 2.1.1 on Hadoop YARN 2.6.5. We found the pyspark.daemon process consume more than 300GB memory. However, according to https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals, the daemon process shouldn't have this problem. Also, we find the daemon proces

Re: spark application running in yarn client mode is slower than in local mode.

2018-04-09 Thread Junfeng Chen
Yeah, I have increase the executor number and executor cores, and it runs normally now. The hdp spark 2 have only 2 executor and 1 executor cores by default. Regard, Junfeng Chen On Tue, Apr 10, 2018 at 10:19 AM, Saisai Shao wrote: > In yarn mode, only two executor are assigned to process the

Re: [Mesos] How to Disable Blacklisting on Mesos?

2018-04-09 Thread Susan X. Huynh
Hi Han, You may be seeing the same issue I described here: https://issues.apache.org/jira/browse/SPARK-22342?focusedCommentId=16411780&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16411780 Do you see "TASK_LOST" in your driver logs? I got past that issue by updat

Re: spark application running in yarn client mode is slower than in local mode.

2018-04-09 Thread Saisai Shao
> > In yarn mode, only two executor are assigned to process the task, since > one executor can process one task only, they need 6 min in total. > This is not true. You should set --executor-cores/--num-executors to increase the task parallelism for executor. To be fair, Spark application should ha

Re: spark application running in yarn client mode is slower than in local mode.

2018-04-09 Thread Junfeng Chen
I found the potential reason. In local mode, all tasks in one stage runs concurrently, while tasks in yarn mode runs in sequence. For example, in one stage, each task costs 3 mins. If in local mode, they will run together, and cost 3 min in total. In yarn mode, only two executor are assigned to p

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-09 Thread Gourav Sengupta
Hi, what I am curious about is the reassignment of df. Can you please look into the explain plan of df after the statement df = df.join(df_t.select("ID"),["ID"])? And then compare with the explain plan of df1 after the statement df1 = df.join(df_t.select("ID"),["ID"])? Its late here, but I am ye

A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-09 Thread Shiyuan
Hi Spark Users, The following code snippet has an "attribute missing" error while the attribute exists. This bug is triggered by a particular sequence of of "select", "groupby" and "join". Note that if I take away the "select" in #line B, the code runs without error. However, the "select

Re: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-09 Thread Aakash Basu
Hey Felix, I've already tried with .format("memory") .queryName("tableName") but, still, it doesn't work for the second query. It just stalls the program expecting new data for the first query. Here's my code - from pyspark.sql import SparkSession from pyspark.sql.functions import split sp

Re: spark application running in yarn client mode is slower than in local mode.

2018-04-09 Thread Junfeng Chen
Hi Jorn, I checked the log info of my application: The ResultStage3 (parquet writing) cost a very long time,nearly 300s, where the total processing time of this loop is 6 mins. Regard, Junfeng Chen On Mon, Apr 9, 2018 at 2:12 PM, Jörn Franke wrote: > Probably network / shuffling cost? Or broa