hi,
My kafka topic has three partitions. The time cost I mentioned means ,
each streaming loop cost more time with yarn client mode. For example yarn
mode cost 300 seconds to process some data, and local mode just cost 200
seconds to process similar amount of data.
Regard,
Junfeng Chen
On Mon
I read json string value from kafka, then transform them to df:
Dataset df = spark.read().json(stringjavaRDD);
Then add some new data to each row:
> JavaRDD rowJavaRDD = df.javaRDD().map(...)
> StructType type = df.schema().add()
> Dataset newdf = spark.createDataFrame(rowJavaRDD,type);
.
Hi Junfeng ,
Is your kafka topic partitioned?
Are you referring to the duration or the CPU time spent by the job as being 20%
- 50% higher than running in local?
Thanks & Regards
Gopal
> On 09-Apr-2018, at 11:42 AM, Jörn Franke wrote:
>
> Probably network / shuffling cost? Or broadcast v
Probably network / shuffling cost? Or broadcast variables? Can you provide more
details what you do and some timings?
> On 9. Apr 2018, at 07:07, Junfeng Chen wrote:
>
> I have wrote an spark streaming application reading kafka data and convert
> the json data to parquet and save to hdfs.
> W
I have wrote an spark streaming application reading kafka data and convert
the json data to parquet and save to hdfs.
What make me puzzled is, the processing time of app in yarn mode cost 20%
to 50% more time than in local mode. My cluster have three nodes with three
node managers, and all three ho
Hi all,
Spark currently has blacklisting enabled on Mesos, no matter what:
[SPARK-19755][Mesos] Blacklist is always active for
MesosCoarseGrainedSchedulerBackend
Blacklisting also prevents new drivers from running on our nodes where
previous drivers' had failed tasks.
We've tried restarting Spar
The value is already stored in azure blob store and the entities in T1
reference it. My problem is that in the computation I need to run, in order
to fetch the referenced value I pay a very large i/o penalty.
The reason being that this is done once per record in T1, which may contain
1 million rec
What do you mean the value is very large in t2? How large? What is it? You
could put the large data in separate files on HDFS and just maintain a file
name in the table.
> On 8. Apr 2018, at 19:52, Vitaliy Pisarev
> wrote:
>
> I have two tables in spark:
>
> T1
> |--x1
> |--x2
>
> T2
> |--
I have two tables in spark:
T1
|--x1
|--x2
T2
|--z1
|--z2
- T1 is much larger than T2
- The values in column z2 are *very large*
- There is a Many-One relationships between T1 and T2 respectively (via
the x2 and z1 columns).
I perform the following query:
select T1.x1, T2.z2 from