Re: Hive on Spark knobs

2016-01-29 Thread Ruslan Dautkhanov
Yep, I tried that. It seems you're right. Got an error that execution
engine has to be set to mr.

hive.execution.engine = mr

I did not keep exact error message/stack. It's probably disabled explicitly.


-- 
Ruslan Dautkhanov

On Thu, Jan 28, 2016 at 7:03 AM, Todd  wrote:

> Did you run hive on spark with spark 1.5 and hive 1.1?
> I think hive on spark doesn't support spark 1.5. There are compatibility
> issues.
>
>
> At 2016-01-28 01:51:43, "Ruslan Dautkhanov"  wrote:
>
>
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
>
> There are quite a lot of knobs to tune for Hive on Spark.
>
> Above page recommends following settings:
>
> mapreduce.input.fileinputformat.split.maxsize=75000
>> hive.vectorized.execution.enabled=true
>> hive.cbo.enable=true
>> hive.optimize.reducededuplication.min.reducer=4
>> hive.optimize.reducededuplication=true
>> hive.orc.splits.include.file.footer=false
>> hive.merge.mapfiles=true
>> hive.merge.sparkfiles=false
>> hive.merge.smallfiles.avgsize=1600
>> hive.merge.size.per.task=25600
>> hive.merge.orcfile.stripe.level=true
>> hive.auto.convert.join=true
>> hive.auto.convert.join.noconditionaltask=true
>> hive.auto.convert.join.noconditionaltask.size=894435328
>> hive.optimize.bucketmapjoin.sortedmerge=false
>> hive.map.aggr.hash.percentmemory=0.5
>> hive.map.aggr=true
>> hive.optimize.sort.dynamic.partition=false
>> hive.stats.autogather=true
>> hive.stats.fetch.column.stats=true
>> hive.vectorized.execution.reduce.enabled=false
>> hive.vectorized.groupby.checkinterval=4096
>> hive.vectorized.groupby.flush.percent=0.1
>> hive.compute.query.using.stats=true
>> hive.limit.pushdown.memory.usage=0.4
>> hive.optimize.index.filter=true
>> hive.exec.reducers.bytes.per.reducer=67108864
>> hive.smbjoin.cache.rows=1
>> hive.exec.orc.default.stripe.size=67108864
>> hive.fetch.task.conversion=more
>> hive.fetch.task.conversion.threshold=1073741824
>> hive.fetch.task.aggr=false
>> mapreduce.input.fileinputformat.list-status.num-threads=5
>> spark.kryo.referenceTracking=false
>>
>> spark.kryo.classesToRegister=org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch
>
>
> Did it work for everybody? It may take days if not weeks to try to tune
> all of these parameters for a specific job.
>
> We're on Spark 1.5 / Hive 1.1.
>
>
> ps. We have a job that can't get working well as a Hive job, so thought to
> use Hive on Spark instead. (a 3-table full outer joins with group by +
> collect_list). Spark should handle this much better.
>
>
> Ruslan
>
>
>


Hive on Spark knobs

2016-01-27 Thread Ruslan Dautkhanov
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

There are quite a lot of knobs to tune for Hive on Spark.

Above page recommends following settings:

mapreduce.input.fileinputformat.split.maxsize=75000
> hive.vectorized.execution.enabled=true
> hive.cbo.enable=true
> hive.optimize.reducededuplication.min.reducer=4
> hive.optimize.reducededuplication=true
> hive.orc.splits.include.file.footer=false
> hive.merge.mapfiles=true
> hive.merge.sparkfiles=false
> hive.merge.smallfiles.avgsize=1600
> hive.merge.size.per.task=25600
> hive.merge.orcfile.stripe.level=true
> hive.auto.convert.join=true
> hive.auto.convert.join.noconditionaltask=true
> hive.auto.convert.join.noconditionaltask.size=894435328
> hive.optimize.bucketmapjoin.sortedmerge=false
> hive.map.aggr.hash.percentmemory=0.5
> hive.map.aggr=true
> hive.optimize.sort.dynamic.partition=false
> hive.stats.autogather=true
> hive.stats.fetch.column.stats=true
> hive.vectorized.execution.reduce.enabled=false
> hive.vectorized.groupby.checkinterval=4096
> hive.vectorized.groupby.flush.percent=0.1
> hive.compute.query.using.stats=true
> hive.limit.pushdown.memory.usage=0.4
> hive.optimize.index.filter=true
> hive.exec.reducers.bytes.per.reducer=67108864
> hive.smbjoin.cache.rows=1
> hive.exec.orc.default.stripe.size=67108864
> hive.fetch.task.conversion=more
> hive.fetch.task.conversion.threshold=1073741824
> hive.fetch.task.aggr=false
> mapreduce.input.fileinputformat.list-status.num-threads=5
> spark.kryo.referenceTracking=false
>
> spark.kryo.classesToRegister=org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch


Did it work for everybody? It may take days if not weeks to try to tune all
of these parameters for a specific job.

We're on Spark 1.5 / Hive 1.1.


ps. We have a job that can't get working well as a Hive job, so thought to
use Hive on Spark instead. (a 3-table full outer joins with group by +
collect_list). Spark should handle this much better.


Ruslan