Re: spark application running in yarn client mode is slower than in local mode.

2018-04-08 Thread Junfeng Chen
hi, My kafka topic has three partitions. The time cost I mentioned means , each streaming loop cost more time with yarn client mode. For example yarn mode cost 300 seconds to process some data, and local mode just cost 200 seconds to process similar amount of data. Regard, Junfeng Chen On Mon

Re: spark application running in yarn client mode is slower than in local mode.

2018-04-08 Thread Junfeng Chen
I read json string value from kafka, then transform them to df: Dataset df = spark.read().json(stringjavaRDD); Then add some new data to each row: > JavaRDD rowJavaRDD = df.javaRDD().map(...) > StructType type = df.schema().add() > Dataset newdf = spark.createDataFrame(rowJavaRDD,type); .

Re: spark application running in yarn client mode is slower than in local mode.

2018-04-08 Thread Gopala Krishna Manchukonda
Hi Junfeng , Is your kafka topic partitioned? Are you referring to the duration or the CPU time spent by the job as being 20% - 50% higher than running in local? Thanks & Regards Gopal > On 09-Apr-2018, at 11:42 AM, Jörn Franke wrote: > > Probably network / shuffling cost? Or broadcast v

Re: spark application running in yarn client mode is slower than in local mode.

2018-04-08 Thread Jörn Franke
Probably network / shuffling cost? Or broadcast variables? Can you provide more details what you do and some timings? > On 9. Apr 2018, at 07:07, Junfeng Chen wrote: > > I have wrote an spark streaming application reading kafka data and convert > the json data to parquet and save to hdfs. > W

spark application running in yarn client mode is slower than in local mode.

2018-04-08 Thread Junfeng Chen
I have wrote an spark streaming application reading kafka data and convert the json data to parquet and save to hdfs. What make me puzzled is, the processing time of app in yarn mode cost 20% to 50% more time than in local mode. My cluster have three nodes with three node managers, and all three ho

[Mesos] How to Disable Blacklisting on Mesos?

2018-04-08 Thread hantuzun
Hi all, Spark currently has blacklisting enabled on Mesos, no matter what: [SPARK-19755][Mesos] Blacklist is always active for MesosCoarseGrainedSchedulerBackend Blacklisting also prevents new drivers from running on our nodes where previous drivers' had failed tasks. We've tried restarting Spar

Re: Does joining table in Spark multiplies selected columns of smaller table?

2018-04-08 Thread Vitaliy Pisarev
The value is already stored in azure blob store and the entities in T1 reference it. My problem is that in the computation I need to run, in order to fetch the referenced value I pay a very large i/o penalty. The reason being that this is done once per record in T1, which may contain 1 million rec

Re: Does joining table in Spark multiplies selected columns of smaller table?

2018-04-08 Thread Jörn Franke
What do you mean the value is very large in t2? How large? What is it? You could put the large data in separate files on HDFS and just maintain a file name in the table. > On 8. Apr 2018, at 19:52, Vitaliy Pisarev > wrote: > > I have two tables in spark: > > T1 > |--x1 > |--x2 > > T2 > |--

Does joining table in Spark multiplies selected columns of smaller table?

2018-04-08 Thread Vitaliy Pisarev
I have two tables in spark: T1 |--x1 |--x2 T2 |--z1 |--z2 - T1 is much larger than T2 - The values in column z2 are *very large* - There is a Many-One relationships between T1 and T2 respectively (via the x2 and z1 columns). I perform the following query: select T1.x1, T2.z2 from