Hi Ranjan, I agree with Igor. I consider it a good practice to point the location to the directory containing the file instead of the file itself.
It's probably some config option that's causing you this problem (You can try using s3 instead of s3n in your paths). It's definitely possible to find it but moving to EMR might be a more time conscious choice. I personally use EMR (not EC2) to do my analysis and it was pretty smooth to get it to work. I would encourage you to give that a shot. Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. ----- Original Message ----- From: "Ranjan Bagchi" <ran...@powerreviews.com> To: user@hive.apache.org Sent: Friday, December 16, 2011 1:43:17 PM Subject: Re: Help with a table located on s3n Following up with more information: * The hadoop cluster is on EC2, not EMR, but I'll try bringing it up on EMR. * I don't see a job conf on the tracker page -- I'm semi-suspicious it never makes it that far. * Here's the extended explain plan: it doesn't look glaringly wrong. Totally appreciate any help, Ranjan hive> explain extended select count(*) from ranjan_test; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME ranjan_test))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONSTAR count))))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: ranjan_test TableScan alias: ranjan_test GatherStats: false Select Operator Group By Operator aggregations: expr: count() bucketGroup: false mode: hash outputColumnNames: _col0 Reduce Output Operator sort order: tag: -1 value expressions: expr: _col0 type: bigint Needs Tagging: false Path -> Alias: s3n://my.bucket/hive/ranjan_test [ranjan_test] Path -> Partition: s3n://my.bucket/hive/ranjan_test Partition base file name: ranjan_test input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: EXTERNAL TRUE bucket_count -1 columns ip_address,num_counted columns.types string:int file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location s3n://my.bucket/hive/ranjan_test name default.ranjan_test serialization.ddl struct ranjan_test { string ip_address, i32 num_counted} serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe transient_lastDdlTime 1323982126 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: EXTERNAL TRUE bucket_count -1 columns ip_address,num_counted columns.types string:int file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location s3n://my.bucket/hive/ranjan_test name default.ranjan_test serialization.ddl struct ranjan_test { string ip_address, i32 num_counted} serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe transient_lastDdlTime 1323982126 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.ranjan_test name: default.ranjan_test Reduce Operator Tree: Group By Operator aggregations: expr: count(VALUE._col0) bucketGroup: false mode: mergepartial outputColumnNames: _col0 Select Operator expressions: expr: _col0 type: bigint outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 directory: hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001 NumFilesPerFileSink: 1 Stats Publishing Key Prefix: hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001/ table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: columns _col0 columns.types bigint serialization.format 1 TotalFiles: 1 GatherStats: false MultiFileSpray: false Stage: Stage-0 Fetch Operator limit: -1 Time taken: 0.156 seconds On Dec 15, 2011, at 5:30 PM, Ranjan Bagchi wrote: Hi, I'm experiencing the following: I've a file on s3 -- s3n://my.bucket/hive/ranjan_test . It's got fields (separated by \001) and records (separated by \n). I want it to be accessible on hive, the ddl is: CREATE EXTERNAL TABLE IF NOT EXISTS ranjan_test ( ip_address string, num_counted int ) STORED AS TEXTFILE LOCATION 's3n://my.bucket/hive/ranjan_test' I'm able to do a simple query: hive> select * from ranjan_test limit 5; OK 98.226.198.23 1676 74.76.148.21 1560 76.64.28.25 1529 170.37.227.10 1363 71.202.128.196 1232 Time taken: 4.172 seconds What I can't do is any select which fires off a mapreduce: ive> select count(*) from ranjan_test; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> java.io.FileNotFoundException: File does not exist: /hive/ranjan_test/part-00000 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:546) at org.apache.hadoop.mapred.lib.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:462) at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:256) at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:212) at org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:347) at org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:313) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:377) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963) at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807) at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:671) at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:513) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186) Job Submission failed with exception 'java.io.FileNotFoundException(File does not exist: /hive/ranjan_test/part-00000)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask Any help? The AWS credentials seem good, 'cause otherwise I wouldn't get the initial stuff. Should I be doing something with the other machines in the cluster? Thanks in advance, Ranjan