Re: loading datafiles in s3

Kennon Lee Thu, 30 Jun 2011 15:49:31 -0700

Sweet, i think we can work with that. Beats renaming a bunch of directories
to adhere to a weird hive-specific naming convention, at least.


Thanks for the help!

On Thu, Jun 30, 2011 at 3:28 PM, Christopher, Pat <
patrick.christop...@hp.com> wrote:

> no.  you can specify the location of a partition as you go.  Hive will use
> the value provided in the partition value clause as the value of the
> partition column but it doesn’t have to be in the file name.  Ex:****
>
> ** **
>
>   ALTER TABLE foo ADD PARITION (dt=’2011-02-01’) SET LOCATION
> ‘s3n://bucket/key/subkey/’;****
>
> ** **
>
> then under subkey you can have as many keys/files as you want and all of
> those files will be added as part of that partition.  Now on the table side
> of things, lets say foo is:****
>
> ** **
>
>   CREATE TABLE foo (****
>
>   allofthedata string****
>
>   ) ****
>
>   partitioned by (dt string)****
>
> ** **
>
> This creates a table with this structure:****
>
> ** **
>
>   allofthedata   string****
>
>   dt                     string****
>
> ** **
>
> Yes!  dt is a column.  Its kind of a pain in the ass.  But, hive will then
> set the ‘dt’ column to the value supplied in the alter table statement for
> all rows that are imported as part of that partition load.****
>
> ** **
>
> tl;dr?****
>
> ** **
>
> If you specifiy the full location of the directory you wish to add, your
> data will be added correctly.****
>
> ** **
>
> pat****
>
> ** **
>
> *From:* Kennon Lee [mailto:ken...@brooklynpacket.com]
> *Sent:* Wednesday, June 29, 2011 1:27 PM
>
> *To:* user@hive.apache.org
> *Subject:* Re: loading datafiles in s3****
>
> ** **
>
> Thanks for the responses. Regarding the first question, I wasnt sure what
> you meant by using ALTER TABLE statements to allow for non-prefixed
> directory names. Don't you still have to name the directories with the
> 'blah=' part? For instance, if we do:****
>
> ** **
>
> ALTER TABLE foo ADD PARTITION (dt='2011-06-29');****
>
> ** **
>
> Doesnt this look for a directory called "dt=2011-06-29"?****
>
> On Tue, Jun 28, 2011 at 12:18 PM, Igor Tatarinov <i...@decide.com> wrote:*
> ***
>
> I think the answer to 1 is No but you can confirm on the AWS EMR forum.***
> *
>
> ** **
>
> The problem I've been having is that if you have x=foo in the prefix of
> your S3 path, EMR will try to use it as part of your partitioning key even
> if you don't want it.****
>
> Say, x=foo/y=bar/data and you want to partition on y only, EMR Hive can get
> confused. Sometimes it works, other times it complains that x is not part of
> your INSERT .. PARTITION(y) clause. I haven't quite figured out when and
> why.****
>
> ** **
>
> On Tue, Jun 28, 2011 at 11:42 AM, Christopher, Pat <
> patrick.christop...@hp.com> wrote:****
>
> allo,****
>
> 1 dunno.  I generate my EMR scripts in a separate script so generating a
> stack of ‘alter table…’ queries is easy for me****
>
> 2 event_b will have a null value in column 4.****
>
> 2 b ( you didn’t ask) what happens with this row:****
>
>  ****
>
>   event_c user_id  france 500 afifthcolumn****
>
>  ****
>
> afifthcolumn will be truncated and you’ll have only event_c through 500 in
> the row****
>
>  ****
>
> Pat****
>
>  ****
>
> *From:* Kennon Lee [mailto:ken...@tinyco.com]
> *Sent:* Monday, June 27, 2011 5:50 PM
> *To:* user@hive.apache.org
> *Subject:* loading datafiles in s3****
>
>  ****
>
> Hello,****
>
> We're using hive on amazon elastic mapreduce to process logs on s3, and I
> had a couple basic questions. Apologies if they've been answered already-- I
> gathered most info from the hive tutorial on amazon (
> http://aws.amazon.com/articles/2855), as well as from skimming the hive
> wiki pages, but I'm still very new to all of this. So, questions:****
>
>  ****
>
> 1) Is it possible to partition on directories that do not have the "key="
> prefix? Our logs are organized like s3://bucketname/dir/YYYY/MM/DD/HH/*.bz2
> and so ideally we could partition on that structure instead of adding "dt="
> to every directory name. I found an old thread discussing this (
> http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded<http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded)>)
> but couldnt find the actual syntax.****
>
>  ****
>
> 2) How does hive handle tab-delimited files where rows sometimes have
> different column counts? For instance, if we are parsing an event log that
> contains multiple events, some of which have more columns associated with
> them:****
>
>  ****
>
> event_a        user_id        apple          300****
>
> event_b        user_id        cat****
>
>  ****
>
> If i define my hive table to have 4 columns, how will hive react to the
> event_b row?****
>
>  ****
>
> Thanks!****
>
>  ****
>
> ** **
>
> ** **
>

Re: loading datafiles in s3

Reply via email to