No.

First, I apologize for my first response.  I guess its never a good idea to 
check email at 4:00 in the morning before your first cup of coffee. ;-)
I went into a bit more detail that may have confused the issue.

To answer your question…
In other words Is querying over plain hive (ORC or Text) always faster than 
through HiveStorageHandler?
No.

Not always.  It will depend on the data, the schema and the query.

HBase is a <KEY, VALUE> store, where the KEY is the rowkey.
HBase partitions its data based on the rowkey.

The rows are stored by rowkey in  lexicographical sorted order. This creates a 
physical index.

With respect to hive, if the query uses a filter against the rowkey, the 
resulting query will perform a range scan.

So…
SELECT *
FROM    someTable
WHERE  rowkey > aValue
AND        rowkey < bValue

This query will result in a range scan filter and depending on the values of 
aValue and bValue, you should exclude a portion of your table.

If you were to store the data in an HDFS table, the odds are your partition 
plan would not be on the rowkey. So your hive query against this table would 
not be able to exclude
data. Depending on how much data was excluded, you could be slower than a query 
against HBase.

There are some performance tuning tips that would help reduce the cost of the 
query.


Does this make sense?

Even though HBase could be slower, there are some reasons why you may want to 
use Hbase over ORC or Parquet.

The problem is that there isn’t a straight forward and simple answer. There are 
a lot of factors that go in to deciding what tools to use.
(e.g. If you’re a Cloudera Fanboi, you’d use Impala as your query engine, 
Hortonworks would push Tez, and MapR is a bit more agnostic.)

And then you have to decide if you want to use Hive, or a query engine or 
something different altogether.

————————————
====================
————————————

HAVING SAID THAT…
Depending on your query, you may want to consider secondary indexing.

When you do that.. HBase could become faster. Again it depends on the query and 
the data.

Hbase takes care of deduping data. Hive does not. So unless your data set is 
immutable (new rows = new data … no updates …) you will have to figure out how 
to dedupe  your data which is outside of Hive.

Note that this falls outside the scope of this discussion. The OP’s question is 
about a base table and doesn’t seem to involve any joins.

HTH

-Mike




On Jun 9, 2017, at 4:50 AM, Amey Barve 
<ameybarv...@gmail.com<mailto:ameybarv...@gmail.com>> wrote:

Hi Michael,

"If there is predicate pushdown, then you will be faster, assuming that the 
query triggers an implied range scan"
---> Does this bring results faster than plain hive querying over ORC / Text 
file formats

In other words Is querying over plain hive (ORC or Text) always faster than 
through HiveStorageHandler?

Regards,
Amey

On 9 June 2017 at 15:08, Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote:
The pro’s is that you have the ability to update a table without having to 
worry about duplication of the row.  Tez is doing some form of compaction for 
you that already exists in HBase.

The cons:

1) Its slower. Reads from HBase have more overhead with them than just reading 
a file.  Read Lars George’s book on what takes place when you do a read.

2) HBase is not a relational store. (You have to think about what that implies)

3) You need to query against your row key for best performance, otherwise it 
will always be a complete table scan.

HBase was designed to give you fast access for direct get() and limited range 
scans.  Otherwise you have to perform full table scans.  This means that unless 
you’re able to do a range scan, your full table scan will be slower than if you 
did this on a flat file set.  Again the reason why you would want to use HBase 
if your data set is mutable.

You also have to trigger a range scan when you write your hive query and you 
have make sure that you’re querying off your row key.

HBase was designed as a <key,value> store. Plain and simple.  If you don’t use 
the key, you have to do a full table scan. So even though you are partitioning 
on row key, you never use your partitions.  However in Hive or Spark, you can 
create an alternative partition pattern.  (e.g your key is the transaction_id, 
yet you partition on month/year portion of the transaction_date)

You can speed things up a little by using an inverted table as a secondary 
index. However this assumes that you want to use joins. If you have a single 
base table with no joins then you can limit your range scans based on making 
sure you are querying against the row key.  Note: This will mean that you have 
limited querying capabilities.

And yes, I’ve done this before but can’t share it with you.

HTH

P.S.
I haven’t tried Hive queries where you have what would be the equivalent of a 
get() .

In earlier versions of hive, the issue would be “SELECT * FROM foo where 
rowkey=BAR”  would still do a full table scan because of the lack of predicate 
pushdown.
This may have been fixed in later releases of hive. That would be your test 
case.   If there is predicate pushdown, then you will be faster, assuming that 
the query triggers an implied range scan.
This would be a simple thing. However keep in mind that you’re going to 
generate a map/reduce job (unless using a query engine like Tez) where you 
wouldn’t if you just wrote your code in Java.




> On Jun 7, 2017, at 5:13 AM, Ramasubramanian Narayanan 
> <ramasubramanian.naraya...@gmail.com<mailto:ramasubramanian.naraya...@gmail.com>>
>  wrote:
>
> Hi,
>
> Can you please let us know Pro and Cons of using HBase table as an external 
> table in HIVE.
>
> Will there be any performance degrade when using Hive over HBase instead of 
> using direct HIVE table.
>
> The table that I am planning to use in HBase will be master table like 
> account, customer. Wanting to achieve Slowly Changing Dimension. Please 
> through some lights on that too if you have done any such implementations.
>
> Thanks and Regards,
> Rams



Reply via email to