Re: Bucketing external tables

Sadananda Hegde Wed, 03 Apr 2013 19:18:19 -0700

Hi Dean,

I tried inserting a bucketed hive table from a non-bucketed table using
insert overwrite .... select from clause; but I get the following error.
----------------------------------------------------------------------------------
Exception in thread "Thread-225" java.lang.NullPointerException
        at
org.apache.hadoop.hive.shims.Hadoop23Shims.getTaskAttemptLogUrl(Hadoop23Shims.java:44)
        at
org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.getTaskInfos(JobDebugger.java:186)
        at
org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.run(JobDebugger.java:142)
        at java.lang.Thread.run(Thread.java:662)
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask
--------------------------------------------------------------------------------------------------------------------------


Both tables have same structure except that that one has CLUSTERED BY
CLAUSE and other not.

Some columns are defined as Array of Structs. The Insert statement works
fine if I take out those complex columns. Are there any known issues
loading STRUCT or ARRAY OF STRUCT fields?


Thanks for your time and help.

Sadu




On Sat, Mar 30, 2013 at 7:00 PM, Dean Wampler <
dean.wamp...@thinkbiganalytics.com> wrote:

> The table can be external. You should be able to use this data with other
> tools, because all bucketing does is ensure that all occurrences for
> records with a given key are written into the same block. This is why
> clustered/blocked data can be joined on those keys using map-side joins;
> Hive knows it can cache ab individual block in memory and the block will
> hold all records across the table for the keys in that block.
>
> So, Java MR apps and Pig can still read the records, but they won't
> necessarily understand how the data is organized. I.e., it might appear
> unsorted. Perhaps HCatalog will allow other tools to exploit the structure,
> but I'm not sure.
>
> dean
>
>
> On Sat, Mar 30, 2013 at 5:44 PM, Sadananda Hegde <saduhe...@gmail.com>wrote:
>
>> Thanks, Dean.
>>
>> Does that mean, this bucketing is exclusively Hive feature and not
>> available to others like Java, Pig, etc?
>>
>> And also, my final tables have to be managed tables; not external tables,
>> right?
>>  .
>> Thank again for your time and help.
>>
>> Sadu
>>
>>
>>
>> On Fri, Mar 29, 2013 at 5:57 PM, Dean Wampler <
>> dean.wamp...@thinkbiganalytics.com> wrote:
>>
>>> I don't know of any way to avoid creating new tables and moving the
>>> data. In fact, that's the official way to do it, from a temp table to the
>>> final table, so Hive can ensure the bucketing is done correctly:
>>>
>>>  https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html
>>>
>>> In other words, you might have a big move now, but going forward, you'll
>>> want to stage your data in a temp table, use this procedure to put it in
>>> the final location, then delete the temp data.
>>>
>>> dean
>>>
>>> On Fri, Mar 29, 2013 at 4:58 PM, Sadananda Hegde <saduhe...@gmail.com>wrote:
>>>
>>>> Hello,
>>>>
>>>> We run M/R jobs to parse and process large and highly complex xml files
>>>> into AVRO files. Then we build external Hive tables on top the parsed Avro
>>>> files. The hive tables are partitioned by day; but they are still huge
>>>> partitions and joins do not perform that well. So I would like to try
>>>> out creating buckets on the join key. How do I create the buckets on the
>>>> existing HDFS files? I would prefer to avoid creating another set of tables
>>>> (bucketed) and load data from non-bucketed table to bucketed tables if at
>>>> all possible. Is it possible to do the bucketing in Java as part of the M/R
>>>> jobs while creating the Avro files?
>>>>
>>>> Any help / insight would greatly be appreciated.
>>>>
>>>> Thank you very much for your time and help.
>>>>
>>>> Sadu
>>>>
>>>
>>>
>>>
>>> --
>>> *Dean Wampler, Ph.D.*
>>> thinkbiganalytics.com
>>> +1-312-339-1330
>>>
>>>
>>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>

Re: Bucketing external tables

Reply via email to