Re: Creating external table poiting to s3 folder with files not loading data

Dean Wampler Mon, 17 Dec 2012 07:08:25 -0800

You raise an important point; "metadata" commands like create table and
alter table only affect metadata, not the actual data itself. So, you have
to write the files into the partition directories yourself and in the
correct schema. One way to do the latter is to stage the raw data in a
"temporary" table and write a query that INSERTS INTO the correct
partitions of the final table.

However, dynamic partitioning with external tables requires a little care.
By default, if the partitions don't already exist, the directories will be
created under
hive.metastore.warehouse.dir, e.g., /data/hive/warehouse, as if the table
is managed, even if you intended the files to be somewhere else. BUT, if
you create the partitions in advance, the data will end up in the correct
directories for each partition.

Here's an example, adapted from the courseware we use at Think Big
Analytics:

First, assume I have some "raw" stocks data in a staged table name
raw_stocks and schema (ymd STRING, symbol STRING, closing_price FLOAT, ...)
where "ymd" is a year-month-day string, e.g., YYYY-MM-DD.

Next, create an EXTERNAL stocks table partitioned by year, for example:

CREATE EXTERNAL TABLE stocks (ymd STRING, symbol STRING, closing_price
FLOAT, ...)
PARTITIONED BY (year INT);

Now, if you don't create the partitions in advance and run the following,
watch what happens:

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

INSERT INTO TABLE stocks PARTITION(year)
SELECT ymd, symbol, price_close, year(ymd) FROM raw_stocks;

I'll end up with directories under /data/hive/warehouse/stocks/, e.g.,
  /data/hive/warehouse/stocks/year=1984:
  /data/hive/warehouse/stocks/year=1985:
  ...
  /data/hive/warehouse/stocks/year=2011:
  /data/hive/warehouse/stocks/year=2012:

You can use DESCRIBE FORMATTED stocks PARTITION(year=XXXX) to confirm these
locations (as well as dfs -ls ...., of course).

In other words, just like managed/internal tables. Also, just to be clear,
if you now drop the table, the directories won't be deleted, even though
they are in the managed table location, because the table is external.

HOWEVER, if you run a query first over raw_stocks to determine all the
years, then you can create the partitions in advance:
(I wrote a bash script to generate these statements and yes, you can build
a single statement that does all of them at once...)

ALTER TABLE stocks2 ADD PARTITION(year=1984) LOCATION '/data/stocks/1984';
...
ALTER TABLE stocks2 ADD PARTITION(year=1984) LOCATION '/data/stocks/2012';

Now, the dynamic partitions query will run as before, but the partitions
will be in

  /data/stocks/year=1984:
  /data/stocks/year=1985:
  ...
  /data/stocks/year=2011:
  /data/stocks/year=2012:

as desired. It should work for S3, etc., as well.

dean

On Mon, Dec 17, 2012 at 5:32 AM, Fernando Andrés Doglio Turissini <
fernando.dog...@globant.com> wrote:

> Hello, and thank you both for your answers...
> I think I found the problem... keep in mind I'm quite new to all this
> Hive/Hadoop stuff :)
>
> I think my problem was due to the fact that the create table statement had
> the partition defined but the information was not partitioned on the file
> system (it was just 1 file inside a folder).
>
> I'm guessing that what I have to do, is load the data into a
> non-partitioned table and then  copy the information using hive and dynamic
> partitioning the data in the same query... is that right?
>
> Thanks again!
>
>
> On Fri, Dec 14, 2012 at 1:22 PM, Dean Wampler <
> dean.wamp...@thinkbiganalytics.com> wrote:
>
>> A couple of clarifying questions and suggestions. First, keep in mind
>> that Hive doesn't care if you have a typo of some kind in your external
>> location ;) Use DESCRIBE FORMATTED to verify the path is right. For an
>> external partitioned table, DESCRIBE FORMATTED table
>> PARTITION(col1=val1,col2=val2,...).
>>
>> A dumb mistake I've often made is use a variable in a script, e.g., "...
>> LOCATION '${DATA}/foo/bar/baz';" and forget to define DATA when invoking
>> the script.
>>
>> When you said "load a file", did you mean using the LOAD DATA ... INPATH
>> 's3n://...' command? I've read that s3n is not supported for these
>> statements, but I'm not sure that's actually true.
>>
>> If everything looks correct, you should be able to do hadoop fs -ls
>> s3n://... successfully. Actually, since your hive environment could have
>> different settings for some filesystem properties, it might be a better
>> check to use dfs -ls ... at the hive CLI prompt.
>>
>> Otherwise, it's probably the SerDe, as Mark suggested. If possible, I
>> would attempt to use the data in some temporary external table using a
>> built-in SerDe, like the default, just to confirm that it's not a file
>> system issue and it's probably the SerDe.
>>
>> Hope that helps.
>> dean
>>
>> On Tue, Dec 11, 2012 at 8:05 AM, Fernando Andrés Doglio Turissini <
>> fernando.dog...@globant.com> wrote:
>>
>>> Long subject, I know.. let me explain a bit more about the problem:
>>>
>>> I'm trying to load a file into a hive table (this is on an EMR instance)
>>> for that I create an external table, and I set the location to the folder
>>> on an s3 bucket, where the file resides.
>>> The problem is that even though the table is created correctly, when I
>>> do a "select * from table" it returns nothing. I'm not seeing errors on the
>>> logs either, so I don't know what can be happening....
>>>
>>> Also, probably important: I'm using a custom SerDe that I did not
>>> write...but I do have the code for it.
>>>
>>> I'm quite new to hive, so I appreciate any kind of pointers you can
>>> throw at me.
>>>
>>> Thanks!
>>> Fernando Doglio
>>>
>>
>>
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>>
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Creating external table poiting to s3 folder with files not loading data

Reply via email to