RE: Hive bulk load into HBase

Brian Salazar Fri, 04 Feb 2011 12:42:28 -0800

That was it! Yeah I agree, can't wait for someone to implement the API.

Thank you very much John :)


-----Original Message-----
From: John Sichi [mailto:jsi...@fb.com] 
Sent: Friday, February 04, 2011 12:12 PM
To: <user@hive.apache.org>
Subject: Re: Hive bulk load into HBase

I think I forgot to put

add jar /path/to/hive_contrib.jar;

in the instructions.  Can you try that?

Also, some things may have changed since those instructions were written; I
recently had to update the way the corresponding unit test works.

Also, since then, HBase has added an API for bulk load (including support
for bulk loading into a table with existing data); a great Hive contribution
would be something which hooks that up and makes the whole thing smoother.

JVS

On Feb 4, 2011, at 11:52 AM, Brian Salazar wrote:

> I have been using the Bulk Load example here:
> http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad
> 
> I am having an issue with a bulk load of 1 million records into HBase 
> on a cluster of 6 using Hive.
> 
> Hive 0.6.0 (built from source to get UDFRowSequence) Hadoop 0.20.2 
> HBase 0.20.6 Zookeeper 3.3.2
> 
> hive> desc cdata_dump;
> OK
> uid     string
> retail_cat_name1        string
> retail_cat_name2        string
> retail_cat_name3        string
> bread_crumb_csv string
> Time taken: 4.194 seconds
> 
> Now my issue:
> 
> hive> set mapred.reduce.tasks=1;
> hive> create temporary function row_sequence as
>     > 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
> OK
> Time taken: 0.0080 seconds
> 
> hive> select uid from
>     > (select uid
>     > from cdata_dump
>     > tablesample(bucket 1 out of 1000 on uid) s
>     > order by uid
>     > limit 1000) x
>     > where (row_sequence() % 100000)=0
>     > order by uid
>     > limit 9;
> 11/02/04 19:25:21 INFO parse.ParseDriver: Parsing command: select uid 
> from (select uid from cdata_dump tablesample(bucket 1 out of 1000 on 
> uid) s order by uid limit 1000) x where (row_sequence() % 100000)=0 
> order by uid limit 9
> 11/02/04 19:25:21 INFO parse.ParseDriver: Parse Completed
> 11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Starting Semantic 
> Analysis
> 11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Completed phase 1 of 
> Semantic Analysis
> 11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Get metadata for source 
> tables
> 11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Get metadata for 
> subqueries
> 11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Get metadata for source 
> tables
> 11/02/04 19:25:21 INFO metastore.HiveMetaStore: 0: Opening raw store 
> with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 11/02/04 19:25:21 INFO metastore.ObjectStore: ObjectStore, initialize 
> called
> 11/02/04 19:25:22 ERROR DataNucleus.Plugin: Bundle 
> "org.eclipse.jdt.core" requires "org.eclipse.core.resources" but it 
> cannot be resolved.
> 11/02/04 19:25:22 ERROR DataNucleus.Plugin: Bundle 
> "org.eclipse.jdt.core" requires "org.eclipse.core.runtime" but it 
> cannot be resolved.
> 11/02/04 19:25:22 ERROR DataNucleus.Plugin: Bundle 
> "org.eclipse.jdt.core" requires "org.eclipse.text" but it cannot be 
> resolved.
> 11/02/04 19:25:23 INFO metastore.ObjectStore: Setting MetaStore object 
> pin classes with 
>
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partitio
n,Database,Type,FieldSchema,Order"
> 11/02/04 19:25:23 INFO metastore.ObjectStore: Initialized ObjectStore
> 11/02/04 19:25:24 INFO metastore.HiveMetaStore: 0: get_table :
> db=default tbl=cdata_dump
> 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, 
> string retail_cat_name1, string retail_cat_name2, string 
> retail_cat_name3, string bread_crumb_csv}
> 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Get metadata for 
> subqueries
> 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Get metadata for 
> destination tables
> 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Get metadata for 
> destination tables
> 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Completed getting 
> MetaData in Semantic Analysis
> 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Need sample filter
> 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: hashfnExpr = class
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid]()
> 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: andExpr = class 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid](),
> Const int 2147483647()
> 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: modExpr = class 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid](),
> Const int 2147483647(), Const int 1000()
> 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: numeratorExpr = Const 
> int 0
> 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: equalsExpr = class 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPEqual(class
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid](),
> Const int 2147483647(), Const int 1000(), Const int 0()
> 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for FS(11)
> 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for LIM(10)
> 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for OP(9)
> 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for RS(8)
> 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for SEL(7)
> 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for FIL(6)
> 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for LIM(5)
> 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for OP(4)
> 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for RS(3)
> 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for SEL(2)
> 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for FIL(1)
> 11/02/04 19:25:25 INFO ppd.OpProcFactory: Pushdown Predicates of FIL 
> For Alias : s
> 11/02/04 19:25:25 INFO ppd.OpProcFactory:       (((hash(uid) &
> 2147483647) % 1000) = 0)
> 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for TS(0)
> 11/02/04 19:25:25 INFO ppd.OpProcFactory: Pushdown Predicates of TS 
> For Alias : s
> 11/02/04 19:25:25 INFO ppd.OpProcFactory:       (((hash(uid) &
> 2147483647) % 1000) = 0)
> 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, 
> string retail_cat_name1, string retail_cat_name2, string 
> retail_cat_name3, string bread_crumb_csv}
> 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, 
> string retail_cat_name1, string retail_cat_name2, string 
> retail_cat_name3, string bread_crumb_csv}
> 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, 
> string retail_cat_name1, string retail_cat_name2, string 
> retail_cat_name3, string bread_crumb_csv}
> 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, 
> string retail_cat_name1, string retail_cat_name2, string 
> retail_cat_name3, string bread_crumb_csv}
> 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, 
> string retail_cat_name1, string retail_cat_name2, string 
> retail_cat_name3, string bread_crumb_csv}
> 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, 
> string retail_cat_name1, string retail_cat_name2, string 
> retail_cat_name3, string bread_crumb_csv}
> 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Completed plan 
> generation
> 11/02/04 19:25:25 INFO ql.Driver: Semantic Analysis Completed
> 11/02/04 19:25:25 INFO ql.Driver: Returning Hive schema:
> Schema(fieldSchemas:[FieldSchema(name:uid, type:string, 
> comment:null)], properties:null)
> 11/02/04 19:25:25 INFO ql.Driver: Starting command: select uid from 
> (select uid from cdata_dump tablesample(bucket 1 out of 1000 on uid) s 
> order by uid limit 1000) x where (row_sequence() % 100000)=0 order by 
> uid limit 9 Total MapReduce jobs = 2
> 11/02/04 19:25:25 INFO ql.Driver: Total MapReduce jobs = 2 Launching 
> Job 1 out of 2
> 11/02/04 19:25:26 INFO ql.Driver: Launching Job 1 out of 2 Number of 
> reduce tasks determined at compile time: 1
> 11/02/04 19:25:26 INFO exec.MapRedTask: Number of reduce tasks 
> determined at compile time: 1 In order to change the average load for 
> a reducer (in bytes):
> 11/02/04 19:25:26 INFO exec.MapRedTask: In order to change the average 
> load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> 11/02/04 19:25:26 INFO exec.MapRedTask:   set
> hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
> 11/02/04 19:25:26 INFO exec.MapRedTask: In order to limit the maximum 
> number of reducers:
>   set hive.exec.reducers.max=<number>
> 11/02/04 19:25:26 INFO exec.MapRedTask:   set
hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
> 11/02/04 19:25:26 INFO exec.MapRedTask: In order to set a constant 
> number of reducers:
>   set mapred.reduce.tasks=<number>
> 11/02/04 19:25:26 INFO exec.MapRedTask:   set mapred.reduce.tasks=<number>
> 11/02/04 19:25:26 INFO exec.MapRedTask: Using 
> org.apache.hadoop.hive.ql.io.HiveInputFormat
> 11/02/04 19:25:26 INFO exec.MapRedTask: adding libjars:
> file:///home/hadoop/hive/build/dist/lib/hive_hbase-handler.jar,file://
> /usr/local/hadoop-0.20.2/zookeeper-3.3.2/zookeeper-3.3.2.jar,file:///u
> sr/local/hadoop-0.20.2/hbase-0.20.6/hbase-0.20.6.jar
> 11/02/04 19:25:26 INFO exec.MapRedTask: Processing alias x:s
> 11/02/04 19:25:26 INFO exec.MapRedTask: Adding input file 
> hdfs://hadoop-1:54310/user/hive/warehouse/cdata_dump
> 11/02/04 19:25:26 WARN mapred.JobClient: Use GenericOptionsParser for 
> parsing the arguments. Applications should implement Tool for the 
> same.
> 11/02/04 19:25:26 INFO mapred.FileInputFormat: Total input paths to 
> process : 1 Starting Job = job_201102040059_0016, Tracking URL =
> http://Hadoop-1:50030/jobdetails.jsp?jobid=job_201102040059_0016
> 11/02/04 19:25:27 INFO exec.MapRedTask: Starting Job = 
> job_201102040059_0016, Tracking URL =
> http://Hadoop-1:50030/jobdetails.jsp?jobid=job_201102040059_0016
> Kill Command = /usr/local/hadoop-0.20.2/bin/../bin/hadoop job
> -Dmapred.job.tracker=hadoop-1:54311 -kill job_201102040059_0016
> 11/02/04 19:25:27 INFO exec.MapRedTask: Kill Command = 
> /usr/local/hadoop-0.20.2/bin/../bin/hadoop job
> -Dmapred.job.tracker=hadoop-1:54311 -kill job_201102040059_0016
> 2011-02-04 19:25:32,266 Stage-1 map = 0%,  reduce = 0%
> 11/02/04 19:25:32 INFO exec.MapRedTask: 2011-02-04 19:25:32,266
> Stage-1 map = 0%,  reduce = 0%
> 2011-02-04 19:25:38,304 Stage-1 map = 100%,  reduce = 0%
> 11/02/04 19:25:38 INFO exec.MapRedTask: 2011-02-04 19:25:38,304
> Stage-1 map = 100%,  reduce = 0%
> 2011-02-04 19:25:47,354 Stage-1 map = 100%,  reduce = 33%
> 11/02/04 19:25:47 INFO exec.MapRedTask: 2011-02-04 19:25:47,354
> Stage-1 map = 100%,  reduce = 33%
> 2011-02-04 19:25:50,377 Stage-1 map = 100%,  reduce = 0%
> 11/02/04 19:25:50 INFO exec.MapRedTask: 2011-02-04 19:25:50,377
> Stage-1 map = 100%,  reduce = 0%
> 2011-02-04 19:25:59,429 Stage-1 map = 100%,  reduce = 33%
> 11/02/04 19:25:59 INFO exec.MapRedTask: 2011-02-04 19:25:59,429
> Stage-1 map = 100%,  reduce = 33%
> 2011-02-04 19:26:02,445 Stage-1 map = 100%,  reduce = 0%
> 11/02/04 19:26:02 INFO exec.MapRedTask: 2011-02-04 19:26:02,445
> Stage-1 map = 100%,  reduce = 0%
> 2011-02-04 19:26:11,484 Stage-1 map = 100%,  reduce = 33%
> 11/02/04 19:26:11 INFO exec.MapRedTask: 2011-02-04 19:26:11,484
> Stage-1 map = 100%,  reduce = 33%
> 2011-02-04 19:26:14,498 Stage-1 map = 100%,  reduce = 0%
> 11/02/04 19:26:14 INFO exec.MapRedTask: 2011-02-04 19:26:14,498
> Stage-1 map = 100%,  reduce = 0%
> 2011-02-04 19:26:24,537 Stage-1 map = 100%,  reduce = 33%
> 11/02/04 19:26:24 INFO exec.MapRedTask: 2011-02-04 19:26:24,537
> Stage-1 map = 100%,  reduce = 33%
> 2011-02-04 19:26:27,549 Stage-1 map = 100%,  reduce = 0%
> 11/02/04 19:26:27 INFO exec.MapRedTask: 2011-02-04 19:26:27,549
> Stage-1 map = 100%,  reduce = 0%
> 2011-02-04 19:26:30,563 Stage-1 map = 100%,  reduce = 100%
> 11/02/04 19:26:30 INFO exec.MapRedTask: 2011-02-04 19:26:30,563
> Stage-1 map = 100%,  reduce = 100%
> Ended Job = job_201102040059_0016 with errors
> 11/02/04 19:26:30 ERROR exec.MapRedTask: Ended Job =
> job_201102040059_0016 with errors
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.MapRedTask
> 11/02/04 19:26:30 ERROR ql.Driver: FAILED: Execution Error, return 
> code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
> 
> 
> I am getting errors like this in the task log:
> 
> 2011-02-04 19:25:44,460 WARN org.apache.hadoop.mapred.TaskTracker:
> Error running child
> java.lang.RuntimeException:
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error 
> while processing row (tag=0) 
> {"key":{"reducesinkkey0":""},"value":{"_col0":""},"alias":0}
>       at
org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:268)
>       at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
>       at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive 
> Runtime Error while processing row (tag=0) 
> {"key":{"reducesinkkey0":""},"value":{"_col0":""},"alias":0}
>       at
org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:256)
>       ... 3 more
> Caused by: java.lang.RuntimeException: java.lang.NullPointerException
>       at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
>       at
org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.initialize(GenericUDF
Bridge.java:126)
>       at
org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprN
odeGenericFuncEvaluator.java:80)
>       at
org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprN
odeGenericFuncEvaluator.java:77)
>       at
org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprN
odeGenericFuncEvaluator.java:77)
>       at
org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:
80)
>       at
org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>       at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744)
>       at
org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:47
)
>       at
org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>       at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744)
>       at
org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.jav
a:45)
>       at
org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>       at
org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:247)
>       ... 3 more
> Caused by: java.lang.NullPointerException
>       at
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>       at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:107)
>       ... 16 more
> 2011-02-04 19:25:44,463 INFO org.apache.hadoop.mapred.TaskRunner:
> Runnning cleanup for the task
> 
> Any ideas?
> 
> Thanks in advance!
> 
> - Brian

RE: Hive bulk load into HBase

Reply via email to