Re: spark application running in yarn client mode is slower than in local mode.
But I still have one question. I find the task number in stage is 3. So where is this 3 from? How to increase the parallelism? Regard, Junfeng Chen On Tue, Apr 10, 2018 at 11:31 AM, Junfeng Chenwrote: > Yeah, I have increase the executor number and executor cores, and it runs > normally now. The hdp spark 2 have only 2 executor and 1 executor cores by > default. > > > Regard, > Junfeng Chen > > On Tue, Apr 10, 2018 at 10:19 AM, Saisai Shao > wrote: > >> In yarn mode, only two executor are assigned to process the task, since >>> one executor can process one task only, they need 6 min in total. >>> >> >> This is not true. You should set --executor-cores/--num-executors to >> increase the task parallelism for executor. To be fair, Spark application >> should have same resources (cpu/memory) when comparing between local and >> yarn mode. >> >> 2018-04-10 10:05 GMT+08:00 Junfeng Chen : >> >>> I found the potential reason. >>> >>> In local mode, all tasks in one stage runs concurrently, while tasks in >>> yarn mode runs in sequence. >>> >>> For example, in one stage, each task costs 3 mins. If in local mode, >>> they will run together, and cost 3 min in total. In yarn mode, only two >>> executor are assigned to process the task, since one executor can process >>> one task only, they need 6 min in total. >>> >>> >>> Regard, >>> Junfeng Chen >>> >>> On Mon, Apr 9, 2018 at 2:12 PM, Jörn Franke >>> wrote: >>> Probably network / shuffling cost? Or broadcast variables? Can you provide more details what you do and some timings? > On 9. Apr 2018, at 07:07, Junfeng Chen wrote: > > I have wrote an spark streaming application reading kafka data and convert the json data to parquet and save to hdfs. > What make me puzzled is, the processing time of app in yarn mode cost 20% to 50% more time than in local mode. My cluster have three nodes with three node managers, and all three hosts have same hardware, 40cores and 256GB memory. . > > Why? How to solve it? > > Regard, > Junfeng Chen >>> >>> >> >
Re: spark application running in yarn client mode is slower than in local mode.
Yeah, I have increase the executor number and executor cores, and it runs normally now. The hdp spark 2 have only 2 executor and 1 executor cores by default. Regard, Junfeng Chen On Tue, Apr 10, 2018 at 10:19 AM, Saisai Shaowrote: > In yarn mode, only two executor are assigned to process the task, since >> one executor can process one task only, they need 6 min in total. >> > > This is not true. You should set --executor-cores/--num-executors to > increase the task parallelism for executor. To be fair, Spark application > should have same resources (cpu/memory) when comparing between local and > yarn mode. > > 2018-04-10 10:05 GMT+08:00 Junfeng Chen : > >> I found the potential reason. >> >> In local mode, all tasks in one stage runs concurrently, while tasks in >> yarn mode runs in sequence. >> >> For example, in one stage, each task costs 3 mins. If in local mode, they >> will run together, and cost 3 min in total. In yarn mode, only two executor >> are assigned to process the task, since one executor can process one task >> only, they need 6 min in total. >> >> >> Regard, >> Junfeng Chen >> >> On Mon, Apr 9, 2018 at 2:12 PM, Jörn Franke wrote: >> >>> Probably network / shuffling cost? Or broadcast variables? Can you >>> provide more details what you do and some timings? >>> >>> > On 9. Apr 2018, at 07:07, Junfeng Chen wrote: >>> > >>> > I have wrote an spark streaming application reading kafka data and >>> convert the json data to parquet and save to hdfs. >>> > What make me puzzled is, the processing time of app in yarn mode cost >>> 20% to 50% more time than in local mode. My cluster have three nodes with >>> three node managers, and all three hosts have same hardware, 40cores and >>> 256GB memory. . >>> > >>> > Why? How to solve it? >>> > >>> > Regard, >>> > Junfeng Chen >>> >> >> >
Re: spark application running in yarn client mode is slower than in local mode.
> > In yarn mode, only two executor are assigned to process the task, since > one executor can process one task only, they need 6 min in total. > This is not true. You should set --executor-cores/--num-executors to increase the task parallelism for executor. To be fair, Spark application should have same resources (cpu/memory) when comparing between local and yarn mode. 2018-04-10 10:05 GMT+08:00 Junfeng Chen: > I found the potential reason. > > In local mode, all tasks in one stage runs concurrently, while tasks in > yarn mode runs in sequence. > > For example, in one stage, each task costs 3 mins. If in local mode, they > will run together, and cost 3 min in total. In yarn mode, only two executor > are assigned to process the task, since one executor can process one task > only, they need 6 min in total. > > > Regard, > Junfeng Chen > > On Mon, Apr 9, 2018 at 2:12 PM, Jörn Franke wrote: > >> Probably network / shuffling cost? Or broadcast variables? Can you >> provide more details what you do and some timings? >> >> > On 9. Apr 2018, at 07:07, Junfeng Chen wrote: >> > >> > I have wrote an spark streaming application reading kafka data and >> convert the json data to parquet and save to hdfs. >> > What make me puzzled is, the processing time of app in yarn mode cost >> 20% to 50% more time than in local mode. My cluster have three nodes with >> three node managers, and all three hosts have same hardware, 40cores and >> 256GB memory. . >> > >> > Why? How to solve it? >> > >> > Regard, >> > Junfeng Chen >> > >
Re: spark application running in yarn client mode is slower than in local mode.
I found the potential reason. In local mode, all tasks in one stage runs concurrently, while tasks in yarn mode runs in sequence. For example, in one stage, each task costs 3 mins. If in local mode, they will run together, and cost 3 min in total. In yarn mode, only two executor are assigned to process the task, since one executor can process one task only, they need 6 min in total. Regard, Junfeng Chen On Mon, Apr 9, 2018 at 2:12 PM, Jörn Frankewrote: > Probably network / shuffling cost? Or broadcast variables? Can you provide > more details what you do and some timings? > > > On 9. Apr 2018, at 07:07, Junfeng Chen wrote: > > > > I have wrote an spark streaming application reading kafka data and > convert the json data to parquet and save to hdfs. > > What make me puzzled is, the processing time of app in yarn mode cost > 20% to 50% more time than in local mode. My cluster have three nodes with > three node managers, and all three hosts have same hardware, 40cores and > 256GB memory. . > > > > Why? How to solve it? > > > > Regard, > > Junfeng Chen >
Re: spark application running in yarn client mode is slower than in local mode.
Hi Jorn, I checked the log info of my application: The ResultStage3 (parquet writing) cost a very long time,nearly 300s, where the total processing time of this loop is 6 mins. Regard, Junfeng Chen On Mon, Apr 9, 2018 at 2:12 PM, Jörn Frankewrote: > Probably network / shuffling cost? Or broadcast variables? Can you provide > more details what you do and some timings? > > > On 9. Apr 2018, at 07:07, Junfeng Chen wrote: > > > > I have wrote an spark streaming application reading kafka data and > convert the json data to parquet and save to hdfs. > > What make me puzzled is, the processing time of app in yarn mode cost > 20% to 50% more time than in local mode. My cluster have three nodes with > three node managers, and all three hosts have same hardware, 40cores and > 256GB memory. . > > > > Why? How to solve it? > > > > Regard, > > Junfeng Chen >
Re: spark application running in yarn client mode is slower than in local mode.
hi, My kafka topic has three partitions. The time cost I mentioned means , each streaming loop cost more time with yarn client mode. For example yarn mode cost 300 seconds to process some data, and local mode just cost 200 seconds to process similar amount of data. Regard, Junfeng Chen On Mon, Apr 9, 2018 at 2:20 PM, Gopala Krishna Manchukonda < gopala_krishna_manchuko...@apple.com> wrote: > Hi Junfeng , > > Is your kafka topic partitioned? > > Are you referring to the duration or the CPU time spent by the job as > being 20% - 50% higher than running in local? > > Thanks & Regards > Gopal > > > > On 09-Apr-2018, at 11:42 AM, Jörn Frankewrote: > > > > Probably network / shuffling cost? Or broadcast variables? Can you > provide more details what you do and some timings? > > > >> On 9. Apr 2018, at 07:07, Junfeng Chen wrote: > >> > >> I have wrote an spark streaming application reading kafka data and > convert the json data to parquet and save to hdfs. > >> What make me puzzled is, the processing time of app in yarn mode cost > 20% to 50% more time than in local mode. My cluster have three nodes with > three node managers, and all three hosts have same hardware, 40cores and > 256GB memory. . > >> > >> Why? How to solve it? > >> > >> Regard, > >> Junfeng Chen > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > >
Re: spark application running in yarn client mode is slower than in local mode.
I read json string value from kafka, then transform them to df: Dataset df = spark.read().json(stringjavaRDD); Then add some new data to each row: > JavaRDD rowJavaRDD = df.javaRDD().map(...) > StructType type = df.schema().add() > Dataset newdf = spark.createDataFrame(rowJavaRDD,type); ... At last write the dataset to parquet file newdf.write().mode(SaveMode.Append).partitionedBy("stream","appname","year","month","day","hour").parquet(savePath); How to determine if it is caused by shuffle or broadcast? Regard, Junfeng Chen On Mon, Apr 9, 2018 at 2:12 PM, Jörn Frankewrote: > Probably network / shuffling cost? Or broadcast variables? Can you provide > more details what you do and some timings? > > > On 9. Apr 2018, at 07:07, Junfeng Chen wrote: > > > > I have wrote an spark streaming application reading kafka data and > convert the json data to parquet and save to hdfs. > > What make me puzzled is, the processing time of app in yarn mode cost > 20% to 50% more time than in local mode. My cluster have three nodes with > three node managers, and all three hosts have same hardware, 40cores and > 256GB memory. . > > > > Why? How to solve it? > > > > Regard, > > Junfeng Chen >
Re: spark application running in yarn client mode is slower than in local mode.
Hi Junfeng , Is your kafka topic partitioned? Are you referring to the duration or the CPU time spent by the job as being 20% - 50% higher than running in local? Thanks & Regards Gopal > On 09-Apr-2018, at 11:42 AM, Jörn Frankewrote: > > Probably network / shuffling cost? Or broadcast variables? Can you provide > more details what you do and some timings? > >> On 9. Apr 2018, at 07:07, Junfeng Chen wrote: >> >> I have wrote an spark streaming application reading kafka data and convert >> the json data to parquet and save to hdfs. >> What make me puzzled is, the processing time of app in yarn mode cost 20% to >> 50% more time than in local mode. My cluster have three nodes with three >> node managers, and all three hosts have same hardware, 40cores and 256GB >> memory. . >> >> Why? How to solve it? >> >> Regard, >> Junfeng Chen > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: spark application running in yarn client mode is slower than in local mode.
Probably network / shuffling cost? Or broadcast variables? Can you provide more details what you do and some timings? > On 9. Apr 2018, at 07:07, Junfeng Chenwrote: > > I have wrote an spark streaming application reading kafka data and convert > the json data to parquet and save to hdfs. > What make me puzzled is, the processing time of app in yarn mode cost 20% to > 50% more time than in local mode. My cluster have three nodes with three node > managers, and all three hosts have same hardware, 40cores and 256GB memory. . > > Why? How to solve it? > > Regard, > Junfeng Chen - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
spark application running in yarn client mode is slower than in local mode.
I have wrote an spark streaming application reading kafka data and convert the json data to parquet and save to hdfs. What make me puzzled is, the processing time of app in yarn mode cost 20% to 50% more time than in local mode. My cluster have three nodes with three node managers, and all three hosts have same hardware, 40cores and 256GB memory. . Why? How to solve it? Regard, Junfeng Chen