Hi Dean, I tried inserting a bucketed hive table from a non-bucketed table using insert overwrite .... select from clause; but I get the following error. ---------------------------------------------------------------------------------- Exception in thread "Thread-225" java.lang.NullPointerException at org.apache.hadoop.hive.shims.Hadoop23Shims.getTaskAttemptLogUrl(Hadoop23Shims.java:44) at org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.getTaskInfos(JobDebugger.java:186) at org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.run(JobDebugger.java:142) at java.lang.Thread.run(Thread.java:662) FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask --------------------------------------------------------------------------------------------------------------------------
Both tables have same structure except that that one has CLUSTERED BY CLAUSE and other not. Some columns are defined as Array of Structs. The Insert statement works fine if I take out those complex columns. Are there any known issues loading STRUCT or ARRAY OF STRUCT fields? Thanks for your time and help. Sadu On Sat, Mar 30, 2013 at 7:00 PM, Dean Wampler < dean.wamp...@thinkbiganalytics.com> wrote: > The table can be external. You should be able to use this data with other > tools, because all bucketing does is ensure that all occurrences for > records with a given key are written into the same block. This is why > clustered/blocked data can be joined on those keys using map-side joins; > Hive knows it can cache ab individual block in memory and the block will > hold all records across the table for the keys in that block. > > So, Java MR apps and Pig can still read the records, but they won't > necessarily understand how the data is organized. I.e., it might appear > unsorted. Perhaps HCatalog will allow other tools to exploit the structure, > but I'm not sure. > > dean > > > On Sat, Mar 30, 2013 at 5:44 PM, Sadananda Hegde <saduhe...@gmail.com>wrote: > >> Thanks, Dean. >> >> Does that mean, this bucketing is exclusively Hive feature and not >> available to others like Java, Pig, etc? >> >> And also, my final tables have to be managed tables; not external tables, >> right? >> . >> Thank again for your time and help. >> >> Sadu >> >> >> >> On Fri, Mar 29, 2013 at 5:57 PM, Dean Wampler < >> dean.wamp...@thinkbiganalytics.com> wrote: >> >>> I don't know of any way to avoid creating new tables and moving the >>> data. In fact, that's the official way to do it, from a temp table to the >>> final table, so Hive can ensure the bucketing is done correctly: >>> >>> https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html >>> >>> In other words, you might have a big move now, but going forward, you'll >>> want to stage your data in a temp table, use this procedure to put it in >>> the final location, then delete the temp data. >>> >>> dean >>> >>> On Fri, Mar 29, 2013 at 4:58 PM, Sadananda Hegde <saduhe...@gmail.com>wrote: >>> >>>> Hello, >>>> >>>> We run M/R jobs to parse and process large and highly complex xml files >>>> into AVRO files. Then we build external Hive tables on top the parsed Avro >>>> files. The hive tables are partitioned by day; but they are still huge >>>> partitions and joins do not perform that well. So I would like to try >>>> out creating buckets on the join key. How do I create the buckets on the >>>> existing HDFS files? I would prefer to avoid creating another set of tables >>>> (bucketed) and load data from non-bucketed table to bucketed tables if at >>>> all possible. Is it possible to do the bucketing in Java as part of the M/R >>>> jobs while creating the Avro files? >>>> >>>> Any help / insight would greatly be appreciated. >>>> >>>> Thank you very much for your time and help. >>>> >>>> Sadu >>>> >>> >>> >>> >>> -- >>> *Dean Wampler, Ph.D.* >>> thinkbiganalytics.com >>> +1-312-339-1330 >>> >>> >> > > > -- > *Dean Wampler, Ph.D.* > thinkbiganalytics.com > +1-312-339-1330 > >