Sweet, i think we can work with that. Beats renaming a bunch of directories to adhere to a weird hive-specific naming convention, at least.
Thanks for the help! On Thu, Jun 30, 2011 at 3:28 PM, Christopher, Pat < patrick.christop...@hp.com> wrote: > no. you can specify the location of a partition as you go. Hive will use > the value provided in the partition value clause as the value of the > partition column but it doesn’t have to be in the file name. Ex:**** > > ** ** > > ALTER TABLE foo ADD PARITION (dt=’2011-02-01’) SET LOCATION > ‘s3n://bucket/key/subkey/’;**** > > ** ** > > then under subkey you can have as many keys/files as you want and all of > those files will be added as part of that partition. Now on the table side > of things, lets say foo is:**** > > ** ** > > CREATE TABLE foo (**** > > allofthedata string**** > > ) **** > > partitioned by (dt string)**** > > ** ** > > This creates a table with this structure:**** > > ** ** > > allofthedata string**** > > dt string**** > > ** ** > > Yes! dt is a column. Its kind of a pain in the ass. But, hive will then > set the ‘dt’ column to the value supplied in the alter table statement for > all rows that are imported as part of that partition load.**** > > ** ** > > tl;dr?**** > > ** ** > > If you specifiy the full location of the directory you wish to add, your > data will be added correctly.**** > > ** ** > > pat**** > > ** ** > > *From:* Kennon Lee [mailto:ken...@brooklynpacket.com] > *Sent:* Wednesday, June 29, 2011 1:27 PM > > *To:* user@hive.apache.org > *Subject:* Re: loading datafiles in s3**** > > ** ** > > Thanks for the responses. Regarding the first question, I wasnt sure what > you meant by using ALTER TABLE statements to allow for non-prefixed > directory names. Don't you still have to name the directories with the > 'blah=' part? For instance, if we do:**** > > ** ** > > ALTER TABLE foo ADD PARTITION (dt='2011-06-29');**** > > ** ** > > Doesnt this look for a directory called "dt=2011-06-29"?**** > > On Tue, Jun 28, 2011 at 12:18 PM, Igor Tatarinov <i...@decide.com> wrote:* > *** > > I think the answer to 1 is No but you can confirm on the AWS EMR forum.*** > * > > ** ** > > The problem I've been having is that if you have x=foo in the prefix of > your S3 path, EMR will try to use it as part of your partitioning key even > if you don't want it.**** > > Say, x=foo/y=bar/data and you want to partition on y only, EMR Hive can get > confused. Sometimes it works, other times it complains that x is not part of > your INSERT .. PARTITION(y) clause. I haven't quite figured out when and > why.**** > > ** ** > > On Tue, Jun 28, 2011 at 11:42 AM, Christopher, Pat < > patrick.christop...@hp.com> wrote:**** > > allo,**** > > 1 dunno. I generate my EMR scripts in a separate script so generating a > stack of ‘alter table…’ queries is easy for me**** > > 2 event_b will have a null value in column 4.**** > > 2 b ( you didn’t ask) what happens with this row:**** > > **** > > event_c user_id france 500 afifthcolumn**** > > **** > > afifthcolumn will be truncated and you’ll have only event_c through 500 in > the row**** > > **** > > Pat**** > > **** > > *From:* Kennon Lee [mailto:ken...@tinyco.com] > *Sent:* Monday, June 27, 2011 5:50 PM > *To:* user@hive.apache.org > *Subject:* loading datafiles in s3**** > > **** > > Hello,**** > > We're using hive on amazon elastic mapreduce to process logs on s3, and I > had a couple basic questions. Apologies if they've been answered already-- I > gathered most info from the hive tutorial on amazon ( > http://aws.amazon.com/articles/2855), as well as from skimming the hive > wiki pages, but I'm still very new to all of this. So, questions:**** > > **** > > 1) Is it possible to partition on directories that do not have the "key=" > prefix? Our logs are organized like s3://bucketname/dir/YYYY/MM/DD/HH/*.bz2 > and so ideally we could partition on that structure instead of adding "dt=" > to every directory name. I found an old thread discussing this ( > http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded<http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded)>) > but couldnt find the actual syntax.**** > > **** > > 2) How does hive handle tab-delimited files where rows sometimes have > different column counts? For instance, if we are parsing an event log that > contains multiple events, some of which have more columns associated with > them:**** > > **** > > event_a user_id apple 300**** > > event_b user_id cat**** > > **** > > If i define my hive table to have 4 columns, how will hive react to the > event_b row?**** > > **** > > Thanks!**** > > **** > > ** ** > > ** ** >