Re: Spark is only using one worker machine when more are available

Jhon Anderson Cardenas Diaz Thu, 12 Apr 2018 07:20:34 -0700

Hi.

On spark standalone i think you can not specify the number of workers
machines to use but you can achieve that in this way:
https://stackoverflow.com/questions/39399205/spark-standalone-number-executors-cores-control
.


For example, if you want that your jobs run on the 10 machines using all
their cores (10 executors, each one in a different machine and with 40
cores), you can use this configuration:

spark.cores.max        = 400
spark.executor.cores  = 40

If you want more executors with less cores each one (lets say 20 executors,
each one with 20 cores):

spark.cores.max        = 400
spark.executor.cores  = 20

Note that in the last case each worker machine will run two executors.

In summary, use this trick:

number-of-executors = spark.cores.max / spark.executor.cores.

And have in mind that the executors will be divided among the available
workers.

Regards.


2018-04-11 21:39 GMT-05:00 宋源栋 <yuandong.s...@greatopensource.com>:

> Hi
>  1. Spark version : 2.3.0
>  2. jdk: oracle jdk 1.8
>  3. os version: centos 6.8
>  4. spark-env.sh: null
>  5. spark session config:
>
>
> SparkSession.builder().appName("DBScale")
>                 .config("spark.sql.crossJoin.enabled", "true")
>                 .config("spark.sql.adaptive.enabled", "true")
>                 .config("spark.scheduler.mode", "FAIR")
>                 .config("spark.executor.memory", "1g")
>                 .config("spark.executor.cores", 1)
>                 .config("spark.driver.memory", "20")
>                 .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>                 .config("spark.executor.extraJavaOptions",
>                         "-XX:+UseG1GC -XX:+PrintFlagsFinal 
> -XX:+PrintReferenceGC " +
>                                 "-verbose:gc -XX:+PrintGCDetails " +
>                                 "-XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy")
>                 .master(this.spark_master)
>                 .getOrCreate();
>
>   6. core code:
>
>
>          for (SparksqlTableInfo tableInfo: this.dbtable){ // this loop reads 
> data from mysql
>             String dt = "(" + tableInfo.sql + ")" + tableInfo.tmp_table_name;
>             String[] pred = new String[tableInfo.partition_num];
>             if (tableInfo.partition_num > 0) {
>                 for (int j = 0; j < tableInfo.partition_num; j++) {
>                     String str = "some where clause to split mysql table into 
> many partitions";
>                     pred[j] = str;
>                 }
>                 Dataset<Row> jdbcDF = ss.read().jdbc(this.url, dt, pred, 
> connProp); //this.url is mysql-jdbc-url (mysql://XX.XX.XX.XX:XXXX)
>                 jdbcDF.createOrReplaceTempView(tableInfo.tmp_table_name);
>             } else {
>                 logger.warn("[\033[32m" + "partition_num == 0" + "\033[0m]");
>                 Dataset<Row> jdbcDF = ss.read().jdbc(this.url, dt, connProp);
>                 jdbcDF.createOrReplaceTempView(tableInfo.tmp_table_name);
>             }
>         }
>
>
>         // Then run a query and write the result set to mysql
>
>         Dataset<Row> result = ss.sql(this.sql);
>         result.explain(true);
>         connProp.put("rewriteBatchedStatements", "true");
>         connProp.put("sessionVariables", "sql_log_bin=off");
>         result.write().jdbc(this.dst_url, this.dst_table, connProp);
>
>
>
> ------------------------------------------------------------------
> 发件人：Jhon Anderson Cardenas Diaz <jhonderson2...@gmail.com>
> 发送时间：2018年4月11日(星期三) 22:42
> 收件人：宋源栋 <yuandong.s...@greatopensource.com>
> 抄 送：user <user@spark.apache.org>
> 主 题：Re: Spark is only using one worker machine when more are available
>
> Hi, could you please share the environment variables values that you are
> sending when you run the jobs, spark version, etc.. more details.
> Btw, you should take a look on SPARK_WORKER_INSTANCES and
> SPARK_WORKER_CORES if you are using spark 2.0.0
> <https://spark.apache.org/docs/preview/spark-standalone.html>.
>
> Regards.
>
> 2018-04-11 4:10 GMT-05:00 宋源栋 <yuandong.s...@greatopensource.com>:
>
>
> Hi all,
>
> I hava a standalone mode spark cluster without HDFS with 10 machines that
> each one has 40 cpu cores and 128G RAM.
>
> My application is a sparksql application that reads data from database
> "tpch_100g" in mysql and run tpch queries. When loading tables from myql to
> spark, I spilts the biggest table "lineitem" into 600 partitions.
>
> When my application runs, there are only 40 executor(spark.executor.memory
> = 1g, spark.executor.cores = 1) in executor page of spark application web
> and all executors are on the same mathine. It is too slowly that all tasks
> are parallelly running in only one mathine.
>
>
>
>
>
>

Re: Spark is only using one worker machine when more are available

Reply via email to