Hi, Xuefu
     we plan to move the Hive on MapReduce to Hive on Spark selectively.
Because the disposition of cluser consisting of the compute nodes is
uneven, we chose the following disposition at last.

spark.dynamicAllocation.enabled     true
spark.shuffle.service.enabled       true
spark.dynamicAllocation.minExecutors        10
spark.rdd.compress              true

spark.executor.cores    2
spark.executor.memory   7000m
spark.yarn.executor.memoryOverhead      1024

     We sample test dozens of operating online SQL, expecting to find out
which can run on MapReduce and which can run on Spark under the limited
resources.
     Following tios are the conclusion.
     1. If the SQL is not contain shuffle stage, use Hive on MapReduce,
such as  mapjoin and seclect * from table where...
      2. About the SQL which has been join with many times, such as seclect
from table 1 join table 2 join table 3, it is highly suitable for using
Hive on Spark.
      3. As to multi-insert, using Hive on Spark is much faster than using
Hive on MapReduce.
      4. it's possible to occur ''Container killed be YERN for exceeding
memory limits" when using large date which shuttle over 10T, so we don't
advice to use Hive on Spark.

     Do you have more suggestions on when to use Hive on MapReduce or Hive
on Spark? Anyway , you are the writer. ☺

      Best wishes!
      Thank you!

Reply via email to