Yes you should use orc it is much faster and more compact. Additionally you
can apply compression (snappy) to increase performance. Your data
processing pipeline seems to be not.very optimized. You should use the
newest hive version enabling storage indexes and bloom filters on
appropriate
Hi, here I got two things to know.
FIRST:
In our project we use hive.
We daily get new data. We need to process this new data only once. And send
this processed data to RDBMS. Here in processing we majorly use many
complex queries with joins with where condition and grouping functions.
There are
Additionally it is of key importance to use the right data types for the
columns. Use int for ids, int or decimal or float or double etc for
numeric values etc. - A bad data model using varchars and string where not
appropriate is a significant bottle neck.
Furthermore include partition columns
I'm really sorry, by mistake I posted in spark mailing list.
Jorn Frankie Thanks for your reply.
I have many joins, many complex queries and all are table scans. So I think
HBase do not work for me.
On Thursday, August 6, 2015, Jörn Franke jornfra...@gmail.com wrote:
Additionally it is of key