You raise an important point; "metadata" commands like create table and alter table only affect metadata, not the actual data itself. So, you have to write the files into the partition directories yourself and in the correct schema. One way to do the latter is to stage the raw data in a "temporary" table and write a query that INSERTS INTO the correct partitions of the final table.
However, dynamic partitioning with external tables requires a little care. By default, if the partitions don't already exist, the directories will be created under hive.metastore.warehouse.dir, e.g., /data/hive/warehouse, as if the table is managed, even if you intended the files to be somewhere else. BUT, if you create the partitions in advance, the data will end up in the correct directories for each partition. Here's an example, adapted from the courseware we use at Think Big Analytics: First, assume I have some "raw" stocks data in a staged table name raw_stocks and schema (ymd STRING, symbol STRING, closing_price FLOAT, ...) where "ymd" is a year-month-day string, e.g., YYYY-MM-DD. Next, create an EXTERNAL stocks table partitioned by year, for example: CREATE EXTERNAL TABLE stocks (ymd STRING, symbol STRING, closing_price FLOAT, ...) PARTITIONED BY (year INT); Now, if you don't create the partitions in advance and run the following, watch what happens: set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; INSERT INTO TABLE stocks PARTITION(year) SELECT ymd, symbol, price_close, year(ymd) FROM raw_stocks; I'll end up with directories under /data/hive/warehouse/stocks/, e.g., /data/hive/warehouse/stocks/year=1984: /data/hive/warehouse/stocks/year=1985: ... /data/hive/warehouse/stocks/year=2011: /data/hive/warehouse/stocks/year=2012: You can use DESCRIBE FORMATTED stocks PARTITION(year=XXXX) to confirm these locations (as well as dfs -ls ...., of course). In other words, just like managed/internal tables. Also, just to be clear, if you now drop the table, the directories won't be deleted, even though they are in the managed table location, because the table is external. HOWEVER, if you run a query first over raw_stocks to determine all the years, then you can create the partitions in advance: (I wrote a bash script to generate these statements and yes, you can build a single statement that does all of them at once...) ALTER TABLE stocks2 ADD PARTITION(year=1984) LOCATION '/data/stocks/1984'; ... ALTER TABLE stocks2 ADD PARTITION(year=1984) LOCATION '/data/stocks/2012'; Now, the dynamic partitions query will run as before, but the partitions will be in /data/stocks/year=1984: /data/stocks/year=1985: ... /data/stocks/year=2011: /data/stocks/year=2012: as desired. It should work for S3, etc., as well. dean On Mon, Dec 17, 2012 at 5:32 AM, Fernando Andrés Doglio Turissini < fernando.dog...@globant.com> wrote: > Hello, and thank you both for your answers... > I think I found the problem... keep in mind I'm quite new to all this > Hive/Hadoop stuff :) > > I think my problem was due to the fact that the create table statement had > the partition defined but the information was not partitioned on the file > system (it was just 1 file inside a folder). > > I'm guessing that what I have to do, is load the data into a > non-partitioned table and then copy the information using hive and dynamic > partitioning the data in the same query... is that right? > > Thanks again! > > > On Fri, Dec 14, 2012 at 1:22 PM, Dean Wampler < > dean.wamp...@thinkbiganalytics.com> wrote: > >> A couple of clarifying questions and suggestions. First, keep in mind >> that Hive doesn't care if you have a typo of some kind in your external >> location ;) Use DESCRIBE FORMATTED to verify the path is right. For an >> external partitioned table, DESCRIBE FORMATTED table >> PARTITION(col1=val1,col2=val2,...). >> >> A dumb mistake I've often made is use a variable in a script, e.g., "... >> LOCATION '${DATA}/foo/bar/baz';" and forget to define DATA when invoking >> the script. >> >> When you said "load a file", did you mean using the LOAD DATA ... INPATH >> 's3n://...' command? I've read that s3n is not supported for these >> statements, but I'm not sure that's actually true. >> >> If everything looks correct, you should be able to do hadoop fs -ls >> s3n://... successfully. Actually, since your hive environment could have >> different settings for some filesystem properties, it might be a better >> check to use dfs -ls ... at the hive CLI prompt. >> >> Otherwise, it's probably the SerDe, as Mark suggested. If possible, I >> would attempt to use the data in some temporary external table using a >> built-in SerDe, like the default, just to confirm that it's not a file >> system issue and it's probably the SerDe. >> >> Hope that helps. >> dean >> >> On Tue, Dec 11, 2012 at 8:05 AM, Fernando Andrés Doglio Turissini < >> fernando.dog...@globant.com> wrote: >> >>> Long subject, I know.. let me explain a bit more about the problem: >>> >>> I'm trying to load a file into a hive table (this is on an EMR instance) >>> for that I create an external table, and I set the location to the folder >>> on an s3 bucket, where the file resides. >>> The problem is that even though the table is created correctly, when I >>> do a "select * from table" it returns nothing. I'm not seeing errors on the >>> logs either, so I don't know what can be happening.... >>> >>> Also, probably important: I'm using a custom SerDe that I did not >>> write...but I do have the code for it. >>> >>> I'm quite new to hive, so I appreciate any kind of pointers you can >>> throw at me. >>> >>> Thanks! >>> Fernando Doglio >>> >> >> >> >> -- >> *Dean Wampler, Ph.D.* >> thinkbiganalytics.com >> +1-312-339-1330 >> >> >> > -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330