I've made the bucket - which is derived from the enron emails - available at s3:///rjurney_public_web/from_to_date and a sample is available at http://s3.amazonaws.com/rjurney_public_web/from_to_date/part-m-00004
I am using hive 0.9.0. I don't care about partitioning - I just want to load my data any whichaway at this point. Create table isn't working, so I'm trying alter table now. I really want to create a table, then load the data into it, but external would be fine. On Tue, May 29, 2012 at 2:42 PM, Aniket Mokashi <[email protected]> wrote: > I think right URI scheme is s3n://abc/def. We use that with EMR version of > hive in production. > > create table test (schema string) location 's3n://abc/def'; should work. > > On Tue, May 29, 2012 at 2:35 PM, Balaji Rao <[email protected]> wrote: > >> To partition on s3, one would create folders like: >> s3://mybucket/path/dt=2012-05-20 >> dt=2012-05-21 >> dt=2012-05-22 >> >> You can then use: >> create external table from_to(from_address string, to_address string) >> partitioned by (dt string) row format delimited fields terminated by >> '\t' stored as textfile locaton 's3://mybucket/path'; >> >> Then issue the command: >> alter table from_to recover partitions; >> >> You will be able to then use the partitions: >> select from_address, to_address, dt from from_to where dt >='2012-05-21' >> >> On Tue, May 29, 2012 at 5:19 PM, Russell Jurney >> <[email protected]> wrote: >> > I get an error when I create an external table. btw - I can partition >> on dt >> > or from/to address. I'm just not clear on how to partition - my efforts >> > fail. >> > >> > hive> create external table from_to(from_address string, to_address >> string, >> > dt string) >> > > row format delimited fields terminated by '\t' stored as >> textfile >> > location 's3n://rjurney_public_web/from_to_date'; >> > FAILED: Error in metadata: java.lang.IllegalArgumentException: Invalid >> > hostname in URI s3n://rjurney_public_web/from_to_date >> > FAILED: Execution Error, return code 1 from >> > org.apache.hadoop.hive.ql.exec.DDLTask >> > >> > >> > However, I just upgraded to HIVE 0.9, and it works :) No reason to use >> the >> > old stuff when I can scp the new one up. >> > >> > Thanks! >> > >> > On Tue, May 29, 2012 at 1:34 PM, Balaji Rao <[email protected]> >> wrote: >> >> >> >> If you are using hive on EMR, you can create a table directly from the >> >> data on S3: >> >> >> >> From hive, you can create tables that use S3 data like this: >> >> >> >> create external table from_to(from_address string, to_address string, >> >> dt string) row format delimited fields terminated by '\t' stored as >> >> textfile location 's3://rjurney_public_web/from_to_date'; >> >> >> >> You could then: >> >> select <*> from from_to >> >> >> >> Balaji >> >> >> >> On Tue, May 29, 2012 at 4:20 PM, Russell Jurney >> >> <[email protected]> wrote: >> >> > How do I load data from S3 into Hive using Amazon EMR? I've booted a >> >> > small >> >> > cluster, and I want to load a 3-column TSV file from Pig into a table >> >> > like >> >> > this: >> >> > >> >> > create table from_to (from_address string, to_address string, dt >> >> > string); >> >> > >> >> > >> >> > When I run something like this: >> >> > >> >> > load data inpath 's3n://rjurney_public_web/from_to_date' into table >> >> > from_to; >> >> > >> >> > >> >> > I get errors: >> >> > >> >> > FAILED: Error in semantic analysis: Line 1:17 Invalid path >> >> > 's3n://rjurney_public_web/from_to_date': only "file" or "hdfs" file >> >> > systems >> >> > accepted. s3n file system is not supported. >> >> > >> >> > >> >> > There is no distcp on the master node of my EMR cluster, so I can't >> copy >> >> > it >> >> > over. I've read the documentation... and so far after a day of >> trying, >> >> > I >> >> > can't load data into HIVE via EMR. >> >> > >> >> > What am I missing? Thanks! >> >> > -- >> >> > Russell >> >> > Jurney twitter.com/rjurney [email protected] datasyndrome.com >> > >> > >> > >> > >> > -- >> > Russell Jurney twitter.com/rjurney [email protected] >> datasyndrome.com >> > > > > -- > "...:::Aniket:::... Quetzalco@tl" > -- Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com
