I have been using the Bulk Load example here: http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad
I am having an issue with a bulk load of 1 million records into HBase on a cluster of 6 using Hive. Hive 0.6.0 (built from source to get UDFRowSequence) Hadoop 0.20.2 HBase 0.20.6 Zookeeper 3.3.2 hive> desc cdata_dump; OK uid string retail_cat_name1 string retail_cat_name2 string retail_cat_name3 string bread_crumb_csv string Time taken: 4.194 seconds Now my issue: hive> set mapred.reduce.tasks=1; hive> create temporary function row_sequence as > 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence'; OK Time taken: 0.0080 seconds hive> select uid from > (select uid > from cdata_dump > tablesample(bucket 1 out of 1000 on uid) s > order by uid > limit 1000) x > where (row_sequence() % 100000)=0 > order by uid > limit 9; 11/02/04 19:25:21 INFO parse.ParseDriver: Parsing command: select uid from (select uid from cdata_dump tablesample(bucket 1 out of 1000 on uid) s order by uid limit 1000) x where (row_sequence() % 100000)=0 order by uid limit 9 11/02/04 19:25:21 INFO parse.ParseDriver: Parse Completed 11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Starting Semantic Analysis 11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Completed phase 1 of Semantic Analysis 11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Get metadata for source tables 11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Get metadata for subqueries 11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Get metadata for source tables 11/02/04 19:25:21 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 11/02/04 19:25:21 INFO metastore.ObjectStore: ObjectStore, initialize called 11/02/04 19:25:22 ERROR DataNucleus.Plugin: Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.resources" but it cannot be resolved. 11/02/04 19:25:22 ERROR DataNucleus.Plugin: Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.runtime" but it cannot be resolved. 11/02/04 19:25:22 ERROR DataNucleus.Plugin: Bundle "org.eclipse.jdt.core" requires "org.eclipse.text" but it cannot be resolved. 11/02/04 19:25:23 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" 11/02/04 19:25:23 INFO metastore.ObjectStore: Initialized ObjectStore 11/02/04 19:25:24 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=cdata_dump 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, string retail_cat_name1, string retail_cat_name2, string retail_cat_name3, string bread_crumb_csv} 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Get metadata for subqueries 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Get metadata for destination tables 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Get metadata for destination tables 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Completed getting MetaData in Semantic Analysis 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Need sample filter 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: hashfnExpr = class org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid]() 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: andExpr = class org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid](), Const int 2147483647() 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: modExpr = class org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid](), Const int 2147483647(), Const int 1000() 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: numeratorExpr = Const int 0 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: equalsExpr = class org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPEqual(class org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid](), Const int 2147483647(), Const int 1000(), Const int 0() 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for FS(11) 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for LIM(10) 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for OP(9) 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for RS(8) 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for SEL(7) 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for FIL(6) 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for LIM(5) 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for OP(4) 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for RS(3) 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for SEL(2) 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for FIL(1) 11/02/04 19:25:25 INFO ppd.OpProcFactory: Pushdown Predicates of FIL For Alias : s 11/02/04 19:25:25 INFO ppd.OpProcFactory: (((hash(uid) & 2147483647) % 1000) = 0) 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for TS(0) 11/02/04 19:25:25 INFO ppd.OpProcFactory: Pushdown Predicates of TS For Alias : s 11/02/04 19:25:25 INFO ppd.OpProcFactory: (((hash(uid) & 2147483647) % 1000) = 0) 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, string retail_cat_name1, string retail_cat_name2, string retail_cat_name3, string bread_crumb_csv} 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, string retail_cat_name1, string retail_cat_name2, string retail_cat_name3, string bread_crumb_csv} 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, string retail_cat_name1, string retail_cat_name2, string retail_cat_name3, string bread_crumb_csv} 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, string retail_cat_name1, string retail_cat_name2, string retail_cat_name3, string bread_crumb_csv} 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, string retail_cat_name1, string retail_cat_name2, string retail_cat_name3, string bread_crumb_csv} 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, string retail_cat_name1, string retail_cat_name2, string retail_cat_name3, string bread_crumb_csv} 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Completed plan generation 11/02/04 19:25:25 INFO ql.Driver: Semantic Analysis Completed 11/02/04 19:25:25 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:uid, type:string, comment:null)], properties:null) 11/02/04 19:25:25 INFO ql.Driver: Starting command: select uid from (select uid from cdata_dump tablesample(bucket 1 out of 1000 on uid) s order by uid limit 1000) x where (row_sequence() % 100000)=0 order by uid limit 9 Total MapReduce jobs = 2 11/02/04 19:25:25 INFO ql.Driver: Total MapReduce jobs = 2 Launching Job 1 out of 2 11/02/04 19:25:26 INFO ql.Driver: Launching Job 1 out of 2 Number of reduce tasks determined at compile time: 1 11/02/04 19:25:26 INFO exec.MapRedTask: Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): 11/02/04 19:25:26 INFO exec.MapRedTask: In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> 11/02/04 19:25:26 INFO exec.MapRedTask: set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: 11/02/04 19:25:26 INFO exec.MapRedTask: In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> 11/02/04 19:25:26 INFO exec.MapRedTask: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: 11/02/04 19:25:26 INFO exec.MapRedTask: In order to set a constant number of reducers: set mapred.reduce.tasks=<number> 11/02/04 19:25:26 INFO exec.MapRedTask: set mapred.reduce.tasks=<number> 11/02/04 19:25:26 INFO exec.MapRedTask: Using org.apache.hadoop.hive.ql.io.HiveInputFormat 11/02/04 19:25:26 INFO exec.MapRedTask: adding libjars: file:///home/hadoop/hive/build/dist/lib/hive_hbase-handler.jar,file:///usr/local/hadoop-0.20.2/zookeeper-3.3.2/zookeeper-3.3.2.jar,file:///usr/local/hadoop-0.20.2/hbase-0.20.6/hbase-0.20.6.jar 11/02/04 19:25:26 INFO exec.MapRedTask: Processing alias x:s 11/02/04 19:25:26 INFO exec.MapRedTask: Adding input file hdfs://hadoop-1:54310/user/hive/warehouse/cdata_dump 11/02/04 19:25:26 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 11/02/04 19:25:26 INFO mapred.FileInputFormat: Total input paths to process : 1 Starting Job = job_201102040059_0016, Tracking URL = http://Hadoop-1:50030/jobdetails.jsp?jobid=job_201102040059_0016<http://hadoop-1:50030/jobdetails.jsp?jobid=job_201102040059_0016> 11/02/04 19:25:27 INFO exec.MapRedTask: Starting Job = job_201102040059_0016, Tracking URL = http://Hadoop-1:50030/jobdetails.jsp?jobid=job_201102040059_0016<http://hadoop-1:50030/jobdetails.jsp?jobid=job_201102040059_0016> Kill Command = /usr/local/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=hadoop-1:54311 -kill job_201102040059_0016 11/02/04 19:25:27 INFO exec.MapRedTask: Kill Command = /usr/local/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=hadoop-1:54311 -kill job_201102040059_0016 2011-02-04 19:25:32,266 Stage-1 map = 0%, reduce = 0% 11/02/04 19:25:32 INFO exec.MapRedTask: 2011-02-04 19:25:32,266 Stage-1 map = 0%, reduce = 0% 2011-02-04 19:25:38,304 Stage-1 map = 100%, reduce = 0% 11/02/04 19:25:38 INFO exec.MapRedTask: 2011-02-04 19:25:38,304 Stage-1 map = 100%, reduce = 0% 2011-02-04 19:25:47,354 Stage-1 map = 100%, reduce = 33% 11/02/04 19:25:47 INFO exec.MapRedTask: 2011-02-04 19:25:47,354 Stage-1 map = 100%, reduce = 33% 2011-02-04 19:25:50,377 Stage-1 map = 100%, reduce = 0% 11/02/04 19:25:50 INFO exec.MapRedTask: 2011-02-04 19:25:50,377 Stage-1 map = 100%, reduce = 0% 2011-02-04 19:25:59,429 Stage-1 map = 100%, reduce = 33% 11/02/04 19:25:59 INFO exec.MapRedTask: 2011-02-04 19:25:59,429 Stage-1 map = 100%, reduce = 33% 2011-02-04 19:26:02,445 Stage-1 map = 100%, reduce = 0% 11/02/04 19:26:02 INFO exec.MapRedTask: 2011-02-04 19:26:02,445 Stage-1 map = 100%, reduce = 0% 2011-02-04 19:26:11,484 Stage-1 map = 100%, reduce = 33% 11/02/04 19:26:11 INFO exec.MapRedTask: 2011-02-04 19:26:11,484 Stage-1 map = 100%, reduce = 33% 2011-02-04 19:26:14,498 Stage-1 map = 100%, reduce = 0% 11/02/04 19:26:14 INFO exec.MapRedTask: 2011-02-04 19:26:14,498 Stage-1 map = 100%, reduce = 0% 2011-02-04 19:26:24,537 Stage-1 map = 100%, reduce = 33% 11/02/04 19:26:24 INFO exec.MapRedTask: 2011-02-04 19:26:24,537 Stage-1 map = 100%, reduce = 33% 2011-02-04 19:26:27,549 Stage-1 map = 100%, reduce = 0% 11/02/04 19:26:27 INFO exec.MapRedTask: 2011-02-04 19:26:27,549 Stage-1 map = 100%, reduce = 0% 2011-02-04 19:26:30,563 Stage-1 map = 100%, reduce = 100% 11/02/04 19:26:30 INFO exec.MapRedTask: 2011-02-04 19:26:30,563 Stage-1 map = 100%, reduce = 100% Ended Job = job_201102040059_0016 with errors 11/02/04 19:26:30 ERROR exec.MapRedTask: Ended Job = job_201102040059_0016 with errors FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask 11/02/04 19:26:30 ERROR ql.Driver: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask I am getting errors like this in the task log: 2011-02-04 19:25:44,460 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":""},"value":{"_col0":""},"alias":0} at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:268) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":""},"value":{"_col0":""},"alias":0} at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:256) ... 3 more Caused by: java.lang.RuntimeException: java.lang.NullPointerException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.initialize(GenericUDFBridge.java:126) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:80) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:77) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:77) at org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:80) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744) at org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:47) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744) at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:247) ... 3 more Caused by: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:107) ... 16 more 2011-02-04 19:25:44,463 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task Any ideas? Thanks in advance! - Brian