That was it! Yeah I agree, can't wait for someone to implement the API. Thank you very much John :)
-----Original Message----- From: John Sichi [mailto:jsi...@fb.com] Sent: Friday, February 04, 2011 12:12 PM To: <user@hive.apache.org> Subject: Re: Hive bulk load into HBase I think I forgot to put add jar /path/to/hive_contrib.jar; in the instructions. Can you try that? Also, some things may have changed since those instructions were written; I recently had to update the way the corresponding unit test works. Also, since then, HBase has added an API for bulk load (including support for bulk loading into a table with existing data); a great Hive contribution would be something which hooks that up and makes the whole thing smoother. JVS On Feb 4, 2011, at 11:52 AM, Brian Salazar wrote: > I have been using the Bulk Load example here: > http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad > > I am having an issue with a bulk load of 1 million records into HBase > on a cluster of 6 using Hive. > > Hive 0.6.0 (built from source to get UDFRowSequence) Hadoop 0.20.2 > HBase 0.20.6 Zookeeper 3.3.2 > > hive> desc cdata_dump; > OK > uid string > retail_cat_name1 string > retail_cat_name2 string > retail_cat_name3 string > bread_crumb_csv string > Time taken: 4.194 seconds > > Now my issue: > > hive> set mapred.reduce.tasks=1; > hive> create temporary function row_sequence as > > 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence'; > OK > Time taken: 0.0080 seconds > > hive> select uid from > > (select uid > > from cdata_dump > > tablesample(bucket 1 out of 1000 on uid) s > > order by uid > > limit 1000) x > > where (row_sequence() % 100000)=0 > > order by uid > > limit 9; > 11/02/04 19:25:21 INFO parse.ParseDriver: Parsing command: select uid > from (select uid from cdata_dump tablesample(bucket 1 out of 1000 on > uid) s order by uid limit 1000) x where (row_sequence() % 100000)=0 > order by uid limit 9 > 11/02/04 19:25:21 INFO parse.ParseDriver: Parse Completed > 11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Starting Semantic > Analysis > 11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Completed phase 1 of > Semantic Analysis > 11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Get metadata for source > tables > 11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Get metadata for > subqueries > 11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Get metadata for source > tables > 11/02/04 19:25:21 INFO metastore.HiveMetaStore: 0: Opening raw store > with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 11/02/04 19:25:21 INFO metastore.ObjectStore: ObjectStore, initialize > called > 11/02/04 19:25:22 ERROR DataNucleus.Plugin: Bundle > "org.eclipse.jdt.core" requires "org.eclipse.core.resources" but it > cannot be resolved. > 11/02/04 19:25:22 ERROR DataNucleus.Plugin: Bundle > "org.eclipse.jdt.core" requires "org.eclipse.core.runtime" but it > cannot be resolved. > 11/02/04 19:25:22 ERROR DataNucleus.Plugin: Bundle > "org.eclipse.jdt.core" requires "org.eclipse.text" but it cannot be > resolved. > 11/02/04 19:25:23 INFO metastore.ObjectStore: Setting MetaStore object > pin classes with > hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partitio n,Database,Type,FieldSchema,Order" > 11/02/04 19:25:23 INFO metastore.ObjectStore: Initialized ObjectStore > 11/02/04 19:25:24 INFO metastore.HiveMetaStore: 0: get_table : > db=default tbl=cdata_dump > 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, > string retail_cat_name1, string retail_cat_name2, string > retail_cat_name3, string bread_crumb_csv} > 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Get metadata for > subqueries > 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Get metadata for > destination tables > 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Get metadata for > destination tables > 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Completed getting > MetaData in Semantic Analysis > 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Need sample filter > 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: hashfnExpr = class > org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid]() > 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: andExpr = class > org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class > org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid](), > Const int 2147483647() > 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: modExpr = class > org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class > org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class > org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid](), > Const int 2147483647(), Const int 1000() > 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: numeratorExpr = Const > int 0 > 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: equalsExpr = class > org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPEqual(class > org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class > org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class > org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid](), > Const int 2147483647(), Const int 1000(), Const int 0() > 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for FS(11) > 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for LIM(10) > 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for OP(9) > 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for RS(8) > 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for SEL(7) > 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for FIL(6) > 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for LIM(5) > 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for OP(4) > 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for RS(3) > 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for SEL(2) > 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for FIL(1) > 11/02/04 19:25:25 INFO ppd.OpProcFactory: Pushdown Predicates of FIL > For Alias : s > 11/02/04 19:25:25 INFO ppd.OpProcFactory: (((hash(uid) & > 2147483647) % 1000) = 0) > 11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for TS(0) > 11/02/04 19:25:25 INFO ppd.OpProcFactory: Pushdown Predicates of TS > For Alias : s > 11/02/04 19:25:25 INFO ppd.OpProcFactory: (((hash(uid) & > 2147483647) % 1000) = 0) > 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, > string retail_cat_name1, string retail_cat_name2, string > retail_cat_name3, string bread_crumb_csv} > 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, > string retail_cat_name1, string retail_cat_name2, string > retail_cat_name3, string bread_crumb_csv} > 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, > string retail_cat_name1, string retail_cat_name2, string > retail_cat_name3, string bread_crumb_csv} > 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, > string retail_cat_name1, string retail_cat_name2, string > retail_cat_name3, string bread_crumb_csv} > 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, > string retail_cat_name1, string retail_cat_name2, string > retail_cat_name3, string bread_crumb_csv} > 11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, > string retail_cat_name1, string retail_cat_name2, string > retail_cat_name3, string bread_crumb_csv} > 11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Completed plan > generation > 11/02/04 19:25:25 INFO ql.Driver: Semantic Analysis Completed > 11/02/04 19:25:25 INFO ql.Driver: Returning Hive schema: > Schema(fieldSchemas:[FieldSchema(name:uid, type:string, > comment:null)], properties:null) > 11/02/04 19:25:25 INFO ql.Driver: Starting command: select uid from > (select uid from cdata_dump tablesample(bucket 1 out of 1000 on uid) s > order by uid limit 1000) x where (row_sequence() % 100000)=0 order by > uid limit 9 Total MapReduce jobs = 2 > 11/02/04 19:25:25 INFO ql.Driver: Total MapReduce jobs = 2 Launching > Job 1 out of 2 > 11/02/04 19:25:26 INFO ql.Driver: Launching Job 1 out of 2 Number of > reduce tasks determined at compile time: 1 > 11/02/04 19:25:26 INFO exec.MapRedTask: Number of reduce tasks > determined at compile time: 1 In order to change the average load for > a reducer (in bytes): > 11/02/04 19:25:26 INFO exec.MapRedTask: In order to change the average > load for a reducer (in bytes): > set hive.exec.reducers.bytes.per.reducer=<number> > 11/02/04 19:25:26 INFO exec.MapRedTask: set > hive.exec.reducers.bytes.per.reducer=<number> > In order to limit the maximum number of reducers: > 11/02/04 19:25:26 INFO exec.MapRedTask: In order to limit the maximum > number of reducers: > set hive.exec.reducers.max=<number> > 11/02/04 19:25:26 INFO exec.MapRedTask: set hive.exec.reducers.max=<number> > In order to set a constant number of reducers: > 11/02/04 19:25:26 INFO exec.MapRedTask: In order to set a constant > number of reducers: > set mapred.reduce.tasks=<number> > 11/02/04 19:25:26 INFO exec.MapRedTask: set mapred.reduce.tasks=<number> > 11/02/04 19:25:26 INFO exec.MapRedTask: Using > org.apache.hadoop.hive.ql.io.HiveInputFormat > 11/02/04 19:25:26 INFO exec.MapRedTask: adding libjars: > file:///home/hadoop/hive/build/dist/lib/hive_hbase-handler.jar,file:// > /usr/local/hadoop-0.20.2/zookeeper-3.3.2/zookeeper-3.3.2.jar,file:///u > sr/local/hadoop-0.20.2/hbase-0.20.6/hbase-0.20.6.jar > 11/02/04 19:25:26 INFO exec.MapRedTask: Processing alias x:s > 11/02/04 19:25:26 INFO exec.MapRedTask: Adding input file > hdfs://hadoop-1:54310/user/hive/warehouse/cdata_dump > 11/02/04 19:25:26 WARN mapred.JobClient: Use GenericOptionsParser for > parsing the arguments. Applications should implement Tool for the > same. > 11/02/04 19:25:26 INFO mapred.FileInputFormat: Total input paths to > process : 1 Starting Job = job_201102040059_0016, Tracking URL = > http://Hadoop-1:50030/jobdetails.jsp?jobid=job_201102040059_0016 > 11/02/04 19:25:27 INFO exec.MapRedTask: Starting Job = > job_201102040059_0016, Tracking URL = > http://Hadoop-1:50030/jobdetails.jsp?jobid=job_201102040059_0016 > Kill Command = /usr/local/hadoop-0.20.2/bin/../bin/hadoop job > -Dmapred.job.tracker=hadoop-1:54311 -kill job_201102040059_0016 > 11/02/04 19:25:27 INFO exec.MapRedTask: Kill Command = > /usr/local/hadoop-0.20.2/bin/../bin/hadoop job > -Dmapred.job.tracker=hadoop-1:54311 -kill job_201102040059_0016 > 2011-02-04 19:25:32,266 Stage-1 map = 0%, reduce = 0% > 11/02/04 19:25:32 INFO exec.MapRedTask: 2011-02-04 19:25:32,266 > Stage-1 map = 0%, reduce = 0% > 2011-02-04 19:25:38,304 Stage-1 map = 100%, reduce = 0% > 11/02/04 19:25:38 INFO exec.MapRedTask: 2011-02-04 19:25:38,304 > Stage-1 map = 100%, reduce = 0% > 2011-02-04 19:25:47,354 Stage-1 map = 100%, reduce = 33% > 11/02/04 19:25:47 INFO exec.MapRedTask: 2011-02-04 19:25:47,354 > Stage-1 map = 100%, reduce = 33% > 2011-02-04 19:25:50,377 Stage-1 map = 100%, reduce = 0% > 11/02/04 19:25:50 INFO exec.MapRedTask: 2011-02-04 19:25:50,377 > Stage-1 map = 100%, reduce = 0% > 2011-02-04 19:25:59,429 Stage-1 map = 100%, reduce = 33% > 11/02/04 19:25:59 INFO exec.MapRedTask: 2011-02-04 19:25:59,429 > Stage-1 map = 100%, reduce = 33% > 2011-02-04 19:26:02,445 Stage-1 map = 100%, reduce = 0% > 11/02/04 19:26:02 INFO exec.MapRedTask: 2011-02-04 19:26:02,445 > Stage-1 map = 100%, reduce = 0% > 2011-02-04 19:26:11,484 Stage-1 map = 100%, reduce = 33% > 11/02/04 19:26:11 INFO exec.MapRedTask: 2011-02-04 19:26:11,484 > Stage-1 map = 100%, reduce = 33% > 2011-02-04 19:26:14,498 Stage-1 map = 100%, reduce = 0% > 11/02/04 19:26:14 INFO exec.MapRedTask: 2011-02-04 19:26:14,498 > Stage-1 map = 100%, reduce = 0% > 2011-02-04 19:26:24,537 Stage-1 map = 100%, reduce = 33% > 11/02/04 19:26:24 INFO exec.MapRedTask: 2011-02-04 19:26:24,537 > Stage-1 map = 100%, reduce = 33% > 2011-02-04 19:26:27,549 Stage-1 map = 100%, reduce = 0% > 11/02/04 19:26:27 INFO exec.MapRedTask: 2011-02-04 19:26:27,549 > Stage-1 map = 100%, reduce = 0% > 2011-02-04 19:26:30,563 Stage-1 map = 100%, reduce = 100% > 11/02/04 19:26:30 INFO exec.MapRedTask: 2011-02-04 19:26:30,563 > Stage-1 map = 100%, reduce = 100% > Ended Job = job_201102040059_0016 with errors > 11/02/04 19:26:30 ERROR exec.MapRedTask: Ended Job = > job_201102040059_0016 with errors > FAILED: Execution Error, return code 2 from > org.apache.hadoop.hive.ql.exec.MapRedTask > 11/02/04 19:26:30 ERROR ql.Driver: FAILED: Execution Error, return > code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask > > > I am getting errors like this in the task log: > > 2011-02-04 19:25:44,460 WARN org.apache.hadoop.mapred.TaskTracker: > Error running child > java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error > while processing row (tag=0) > {"key":{"reducesinkkey0":""},"value":{"_col0":""},"alias":0} > at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:268) > at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive > Runtime Error while processing row (tag=0) > {"key":{"reducesinkkey0":""},"value":{"_col0":""},"alias":0} > at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:256) > ... 3 more > Caused by: java.lang.RuntimeException: java.lang.NullPointerException > at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) > at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.initialize(GenericUDF Bridge.java:126) > at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprN odeGenericFuncEvaluator.java:80) > at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprN odeGenericFuncEvaluator.java:77) > at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprN odeGenericFuncEvaluator.java:77) > at org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java: 80) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744) > at org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:47 ) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744) > at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.jav a:45) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) > at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:247) > ... 3 more > Caused by: java.lang.NullPointerException > at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) > at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:107) > ... 16 more > 2011-02-04 19:25:44,463 INFO org.apache.hadoop.mapred.TaskRunner: > Runnning cleanup for the task > > Any ideas? > > Thanks in advance! > > - Brian