Re: Standalone cluster node utilization

2016-07-14 Thread Jakub Stransky
I witness really weird behavior when loading the data from RDBMS. I tried different approach for loading the data - I provided a partitioning column for make partitioning parallelism: val df_init = sqlContext.read.format("jdbc").options( Map("url" -> Configuration.dbUrl, "dbtabl

Re: Standalone cluster node utilization

2016-07-14 Thread Jakub Stransky
HI Talebzadeh, sorry I forget to answer last part of your question: At O/S level you should see many CoarseGrainedExecutorBackend through jps each corresponding to one executor. Are they doing anything? There is one worker with one executor bussy and the rest is almost idle: PID USER PR

Re: Standalone cluster node utilization

2016-07-14 Thread Jakub Stransky
HI Talebzadeh, we are using 6 worker machines - running. We are reading the data through sqlContext (data frame) as it is suggested in the documentation over the JdbcRdd prop just specifies name, password, and driver class. Right after this data load we register it as a temp table val df_i

Re: Standalone cluster node utilization

2016-07-14 Thread Mich Talebzadeh
Hi Jakub, Sounds like one executor. Can you point out: 1. The number of slaves/workers you are running 2. Are you using JDBC to read data in? 3. Do you register DF as temp table and if so have you cached temp table Sounds like only one executor is active and the rest are sitting idele.

Re: Standalone cluster node utilization

2016-07-14 Thread Zhou (Joe) Xing
i have seen similar behavior in my standalone cluster, I tried to increase the number of partitions and at some point it seems all the executors or worker nodes start to make parallel connection to remote data store. But it would be nice if someone could point us to some references on how to ma

Standalone cluster node utilization

2016-07-14 Thread Jakub Stransky
Hello, I have a spark cluster running in a single mode, master + 6 executors. My application is reading a data from database via DataFrame.read then there is a filtering of rows. After that I re-partition data and I wonder why on the executors page of the driver UI I see RDD blocks all allocated