Re: Why HBase integation with Hive makes Hive slow

Hao Ren Fri, 02 Aug 2013 08:07:21 -0700

Thank you, lars

The performance is largely improved when setting scanner caching to 10000
But I still encounter a problem.


When loading data to a hbast table via hive, I got a NullPointrExecption:

java.lang.NullPointerException

atorg.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector.get(WritableIntObjectInspector.java:35)atorg.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:199)atorg.apache.hadoop.hive.hbase.HBaseSerDe.serialize(HBaseSerDe.java:696)atorg.apache.hadoop.hive.hbase.HBaseSerDe.serialize(HBaseSerDe.java:758)atorg.apache.hadoop.hive.hbase.HBaseSerDe.serialize(HBaseSerDe.java:713)atorg.apache.hadoop.hive.hbase.HBaseSerDe.serialize(HBaseSerDe.java:758)atorg.apache.hadoop.hive.hbase.HBaseSerDe.serialize(HBaseSerDe.java:713)atorg.apache.hadoop.hive.hbase.HBaseSerDe.serialize(HBaseSerDe.java:685)atorg.apache.hadoop.hive.hbase.HBaseSerDe.serializeField(HBaseSerDe.java:648)atorg.apache.hadoop.hive.hbase.HBaseSerDe.serialize(HBaseSerDe.java:560)atorg.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:568)atshark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:73)atshark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:72)

    at scala.collection.Iterator$class.foreach(Iterator.scala:772)
    at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)

atshark.execution.FileSinkOperator.processPartition(FileSinkOperator.scala:72)atshark.execution.FileSinkOperator$.writeFiles$1(FileSinkOperator.scala:133)atshark.execution.FileSinkOperator$$anonfun$executeProcessFileSinkPartition$1.apply(FileSinkOperator.scala:138)atshark.execution.FileSinkOperator$$anonfun$executeProcessFileSinkPartition$1.apply(FileSinkOperator.scala:138)

    at spark.scheduler.ResultTask.run(ResultTask.scala:77)
    at spark.executor.Executor$TaskRunner.run(Executor.scala:98)

atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

    at java.lang.Thread.run(Thread.java:724)

Here are some queries concerned:

CREATE TABLE hbase_byg_client (
idclient string,
isfictif boolean,

visites array < struct <idvisite:string,datevisite:string,isauthent:boolean, affichages: array <struct < page:string,idcategorie:int,freq:int >>>>)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" =":key,navi:isfictif,navi:visites#s")

TBLPROPERTIES("hbase.table.name" = "byg_client")
;

INSERT OVERWRITE table hbase_byg_client

SELECT idClient, isfictif, collect_row(named_struct('idvisite',idVisite, 'dateVisite', dateVisit, 'isAuthent', isAuthent, 'affichages',t.affichages)) AS visites

FROM(

SELECT idClient, isfictif, idVisite, dateVisit, isAuthent,collect_row(named_struct('page', page, 'IdCategorie', IdCategorie,'freq', freq)) AS affichages

  FROM v_byg_clean
  GROUP BY idClient, isfictif, idVisite, dateVisit, isAuthent) t
GROUP BY idClient, isfictif
;

Actually, hbase_byg_client contains a complex non-primitive type field.

Any workaround here ?

Thank you.

Hao

Le 01/08/2013 21:00, lars hofhansl a écrit :

Need to set scanner caching, otherwise each call to next will be an network RTT.



________________________________
  From: Hao Ren <[email protected]>
To: [email protected]
Sent: Thursday, August 1, 2013 7:45 AM
Subject: Why HBase integation with Hive makes Hive slow

Hi,

I have a cluster (1 master + 3 slaves) on which there Hive, Hbase, and
Hadoop.

In order to do some daily row-level update routine, we need to integrate
Hbase with hive, but the performance is not good.

E.g. There are 2 tables in hive,
      hbase_table:  a hbase table created via Hive
      hive_table: a native hive table
   both hold the same data set.

When runing:
      select count(*) from hbase_table; ===> takes 500 s
      select count(*) from hive_table; ===> takes 6 s

I have tried a lot of queries on the two tables. But hbase_table is
always very slow.

To be claire, I created the hbase_ table as below:

CREATE TABLE hbase_table (
idvisite string,
client_list Array<string>,
nb_client int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,clients:id_list,clients:nb")
TBLPROPERTIES("hbase.table.name" = "table_test")
;

And my Hbase is on pseudo-distributed mode.

I guess, at the beginning of a hive query execution, hive will load data
from Hbase, where serde takes a long time.

Could someone tell me how to improve my poor performance ?
Is this cause by my wrongly configured integration ?
Is a fully-distributed mode needed here ?

Thank you in advance for your time.

Hao.



--
Hao Ren
ClaraVista
www.claravista.fr

Re: Why HBase integation with Hive makes Hive slow

Reply via email to