Hi Jone,

Thanks for trying Hive on Spark. I don't know about your cluster, so I
cannot comment too much on your configurations. We do have a "Getting
Started" guide [1] which you may refer to. (We are currently updating the
document.) Your executor size (cores/memory) seems rather small and not
aligning to our guide well.

To me, there is no reason to use MR unless you have encounter a potential
bug or problem, like #4 in your list. However, it would be great if you can
share more details on the problem. It could be just a matter of heap size,
for which you can increase your executor memory. (I do understand you have
some constraints on that.)

For #1, you probably need to increase one configuration,
hive.auto.convert.join.noconditionaltask.size, which is the threshold for
converting common join to map join based on statistics. Even though this
configuration is used for both Hive on MapReduce and Hive on Spark, it is
interpreted differently. There are two types of statistics about data size:
totalSize and rawDataSize. totalSize is approximately the data size on
disk, while rawDataSize is approximately the data size in memory. Hive on
MapReduce uses totalSize. When both are available, Hive on Spark will
choose rawDataSize. Because of possible compression and serialization,
there could be huge difference between totalSize and rawDataSize for the
same dataset. Thus, For Hive on Spark, you might need to specify a higher
value for the configuration in order to convert the same join to a map
join. Once a join is converted to map join for Spark, then better or
similar performance should be expected.

Hope this helps.

Thanks,
Xuefu

[1]
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/admin_hos_config.html

On Wed, Nov 4, 2015 at 12:23 AM, Jone Zhang <[email protected]> wrote:

> Hi, Xuefu
>      we plan to move the Hive on MapReduce to Hive on Spark selectively.
> Because the disposition of cluser consisting of the compute nodes is
> uneven, we chose the following disposition at last.
>
> spark.dynamicAllocation.enabled     true
> spark.shuffle.service.enabled       true
> spark.dynamicAllocation.minExecutors        10
> spark.rdd.compress              true
>
> spark.executor.cores    2
> spark.executor.memory   7000m
> spark.yarn.executor.memoryOverhead      1024
>
>      We sample test dozens of operating online SQL, expecting to find out
> which can run on MapReduce and which can run on Spark under the limited
> resources.
>      Following tios are the conclusion.
>      1. If the SQL is not contain shuffle stage, use Hive on MapReduce,
> such as  mapjoin and seclect * from table where...
>       2. About the SQL which has been join with many times, such as
> seclect from table 1 join table 2 join table 3, it is highly suitable for
> using Hive on Spark.
>       3. As to multi-insert, using Hive on Spark is much faster than using
> Hive on MapReduce.
>       4. it's possible to occur ''Container killed be YERN for exceeding
> memory limits" when using large date which shuttle over 10T, so we don't
> advice to use Hive on Spark.
>
>      Do you have more suggestions on when to use Hive on MapReduce or Hive
> on Spark? Anyway , you are the writer. ☺
>
>       Best wishes!
>       Thank you!
>

Reply via email to