Hi Jone, Thanks for trying Hive on Spark. I don't know about your cluster, so I cannot comment too much on your configurations. We do have a "Getting Started" guide [1] which you may refer to. (We are currently updating the document.) Your executor size (cores/memory) seems rather small and not aligning to our guide well.
To me, there is no reason to use MR unless you have encounter a potential bug or problem, like #4 in your list. However, it would be great if you can share more details on the problem. It could be just a matter of heap size, for which you can increase your executor memory. (I do understand you have some constraints on that.) For #1, you probably need to increase one configuration, hive.auto.convert.join.noconditionaltask.size, which is the threshold for converting common join to map join based on statistics. Even though this configuration is used for both Hive on MapReduce and Hive on Spark, it is interpreted differently. There are two types of statistics about data size: totalSize and rawDataSize. totalSize is approximately the data size on disk, while rawDataSize is approximately the data size in memory. Hive on MapReduce uses totalSize. When both are available, Hive on Spark will choose rawDataSize. Because of possible compression and serialization, there could be huge difference between totalSize and rawDataSize for the same dataset. Thus, For Hive on Spark, you might need to specify a higher value for the configuration in order to convert the same join to a map join. Once a join is converted to map join for Spark, then better or similar performance should be expected. Hope this helps. Thanks, Xuefu [1] http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/admin_hos_config.html On Wed, Nov 4, 2015 at 12:23 AM, Jone Zhang <[email protected]> wrote: > Hi, Xuefu > we plan to move the Hive on MapReduce to Hive on Spark selectively. > Because the disposition of cluser consisting of the compute nodes is > uneven, we chose the following disposition at last. > > spark.dynamicAllocation.enabled true > spark.shuffle.service.enabled true > spark.dynamicAllocation.minExecutors 10 > spark.rdd.compress true > > spark.executor.cores 2 > spark.executor.memory 7000m > spark.yarn.executor.memoryOverhead 1024 > > We sample test dozens of operating online SQL, expecting to find out > which can run on MapReduce and which can run on Spark under the limited > resources. > Following tios are the conclusion. > 1. If the SQL is not contain shuffle stage, use Hive on MapReduce, > such as mapjoin and seclect * from table where... > 2. About the SQL which has been join with many times, such as > seclect from table 1 join table 2 join table 3, it is highly suitable for > using Hive on Spark. > 3. As to multi-insert, using Hive on Spark is much faster than using > Hive on MapReduce. > 4. it's possible to occur ''Container killed be YERN for exceeding > memory limits" when using large date which shuttle over 10T, so we don't > advice to use Hive on Spark. > > Do you have more suggestions on when to use Hive on MapReduce or Hive > on Spark? Anyway , you are the writer. ☺ > > Best wishes! > Thank you! >
