Do you have some data model? Basically modern technologies, such as Hive, but also relational database, suggest to prejoin tables and working on big flat tables. The reason is that they are distributed systems and you should avoid transferring for each query a lot of data between nodes. Hence, Hive works with the ORC format (or Parquet) to support storage indexes and bloom filters. The idea here is that these techniques allow you to skip reading a lot of data blocks. So the first optimization is using the right format. The second optimization is to use compression. For example, snappy allows fast decompression so it is more suitable for current data. The third optimizations are partitions and buckets. The fourth optimization is evaluating the different type of joins. The fifths optimizations can be in-memory technologies, such as ignite HDFS cache or llap. The six optimization is the execution engine. Tez supports currently in Hive the most optimizations. The seventh optimization is data model: use for join keys int or similar numeric values. Try to use the right data type for your needs. The eight optimization can be hardware improvements. The ninth optimization can be use the right data structure. If your queries are more graph type queries then consider using a graph database, such as TitanDB. The tenth type of optimizations are os/network level type of optimizations. For example using Jumbo frames for your network. Not all necessary in this order and there is much more to optimize.
But as always this depends on your data structure. Sometimes joins can make sense. Most of the time you have to experiment. > On 18 Jan 2016, at 09:07, Divya Gehlot <divya.htco...@gmail.com> wrote: > > Hi, > Need tips/guidance to optimize(increase perfomance) billion data rows joins > in hive . > > Any help would be appreciated. > > > Thanks, > Divya