Do you have some data model?

Basically modern technologies, such as Hive, but also relational database, 
suggest to prejoin tables and working on big flat tables. The reason is that 
they are distributed systems and you should avoid transferring for each query a 
lot of data between nodes. 
Hence, Hive works with the ORC format (or Parquet) to support storage indexes 
and bloom filters. The idea here is that these techniques allow you to skip 
reading a lot of data blocks. So the first optimization is using the right 
format. The second optimization is to use compression. For example, snappy 
allows fast decompression so it is more suitable for current data. The third 
optimizations are partitions and buckets. The fourth optimization is evaluating 
the different type of joins. The fifths optimizations can be in-memory 
technologies, such as ignite HDFS cache or llap. The six optimization is the 
execution engine. Tez supports currently in Hive the most optimizations. The 
seventh optimization is data model: use for join keys int or similar numeric 
values. Try to use the right data type for your needs. The eight optimization 
can be hardware improvements.  The ninth optimization can be use the right data 
structure. If your queries are more graph type queries then consider using a 
graph database, such as TitanDB. The tenth type of optimizations are os/network 
level type of optimizations. For example using Jumbo frames for your network.
Not all necessary in this order and there is much more to optimize.

But as always this depends on your data structure. Sometimes joins can make 
sense. Most of the time you have to experiment.

> On 18 Jan 2016, at 09:07, Divya Gehlot <divya.htco...@gmail.com> wrote:
> 
> Hi,
> Need tips/guidance to optimize(increase perfomance) billion data rows  joins 
> in hive .
> 
> Any help would be appreciated.
> 
> 
> Thanks,
> Divya 

Reply via email to