I just tried it on EMR and it worked as expected. Is there a verbose mode which can help debug what's going on with our EC2-based configuration?
Ranjan On Dec 16, 2011, at 10:55 AM, Mark Grover wrote: > Hi Ranjan, > I agree with Igor. I consider it a good practice to point the location to the > directory containing the file instead of the file itself. > > It's probably some config option that's causing you this problem (You can try > using s3 instead of s3n in your paths). It's definitely possible to find it > but moving to EMR might be a more time conscious choice. I personally use EMR > (not EC2) to do my analysis and it was pretty smooth to get it to work. I > would encourage you to give that a shot. > > Mark Grover, Business Intelligence Analyst > OANDA Corporation > > www: oanda.com www: fxtrade.com > e: mgro...@oanda.com > > "Best Trading Platform" - World Finance's Forex Awards 2009. > "The One to Watch" - Treasury Today's Adam Smith Awards 2009. > > > ----- Original Message ----- > From: "Ranjan Bagchi" <ran...@powerreviews.com> > To: user@hive.apache.org > Sent: Friday, December 16, 2011 1:43:17 PM > Subject: Re: Help with a table located on s3n > > Following up with more information: > > > * The hadoop cluster is on EC2, not EMR, but I'll try bringing it up on EMR. > * I don't see a job conf on the tracker page -- I'm semi-suspicious it never > makes it that far. > * Here's the extended explain plan: it doesn't look glaringly wrong. > > > Totally appreciate any help, > > > Ranjan > > > > hive> explain extended select count(*) from ranjan_test; > OK > ABSTRACT SYNTAX TREE: > (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME ranjan_test))) (TOK_INSERT > (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR > (TOK_FUNCTIONSTAR count))))) > > > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > > > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > ranjan_test > TableScan > alias: ranjan_test > GatherStats: false > Select Operator > Group By Operator > aggregations: > expr: count() > bucketGroup: false > mode: hash > outputColumnNames: _col0 > Reduce Output Operator > sort order: > tag: -1 > value expressions: > expr: _col0 > type: bigint > Needs Tagging: false > Path -> Alias: > s3n://my.bucket/hive/ranjan_test [ranjan_test] > Path -> Partition: > s3n://my.bucket/hive/ranjan_test > Partition > base file name: ranjan_test > input format: org.apache.hadoop.mapred.TextInputFormat > output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > properties: > EXTERNAL TRUE > bucket_count -1 > columns ip_address,num_counted > columns.types string:int > file.inputformat org.apache.hadoop.mapred.TextInputFormat > file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > location s3n://my.bucket/hive/ranjan_test > name default.ranjan_test > serialization.ddl struct ranjan_test { string ip_address, i32 num_counted} > serialization.format 1 > serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > transient_lastDdlTime 1323982126 > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > input format: org.apache.hadoop.mapred.TextInputFormat > output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > properties: > EXTERNAL TRUE > bucket_count -1 > columns ip_address,num_counted > columns.types string:int > file.inputformat org.apache.hadoop.mapred.TextInputFormat > file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > location s3n://my.bucket/hive/ranjan_test > name default.ranjan_test > serialization.ddl struct ranjan_test { string ip_address, i32 num_counted} > serialization.format 1 > serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > transient_lastDdlTime 1323982126 > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > name: default.ranjan_test > name: default.ranjan_test > Reduce Operator Tree: > Group By Operator > aggregations: > expr: count(VALUE._col0) > bucketGroup: false > mode: mergepartial > outputColumnNames: _col0 > Select Operator > expressions: > expr: _col0 > type: bigint > outputColumnNames: _col0 > File Output Operator > compressed: false > GlobalTableId: 0 > directory: > hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001 > > NumFilesPerFileSink: 1 > Stats Publishing Key Prefix: > hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001/ > > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > properties: > columns _col0 > columns.types bigint > serialization.format 1 > TotalFiles: 1 > GatherStats: false > MultiFileSpray: false > > > Stage: Stage-0 > Fetch Operator > limit: -1 > > > > > Time taken: 0.156 seconds > > > > On Dec 15, 2011, at 5:30 PM, Ranjan Bagchi wrote: > > > Hi, > > I'm experiencing the following: > > I've a file on s3 -- s3n://my.bucket/hive/ranjan_test . It's got fields > (separated by \001) and records (separated by \n). > > I want it to be accessible on hive, the ddl is: > CREATE EXTERNAL TABLE IF NOT EXISTS ranjan_test ( > ip_address string, > num_counted int > ) > STORED AS TEXTFILE > LOCATION 's3n://my.bucket/hive/ranjan_test' > > I'm able to do a simple query: > > hive> select * from ranjan_test limit 5; > OK > 98.226.198.23 1676 > 74.76.148.21 1560 > 76.64.28.25 1529 > 170.37.227.10 1363 > 71.202.128.196 1232 > Time taken: 4.172 seconds > > What I can't do is any select which fires off a mapreduce: > > ive> select count(*) from ranjan_test; > Total MapReduce jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks determined at compile time: 1 > In order to change the average load for a reducer (in bytes): > set hive.exec.reducers.bytes.per.reducer=<number> > In order to limit the maximum number of reducers: > set hive.exec.reducers.max=<number> > In order to set a constant number of reducers: > set mapred.reduce.tasks=<number> > java.io.FileNotFoundException: File does not exist: > /hive/ranjan_test/part-00000 > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:546) > > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:462) > > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:256) > > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:212) > > at > org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:347) > > at > org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:313) > > at > org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:377) > > at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971) > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963) > at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) > > at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807) > at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:671) > at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748) > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286) > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:513) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:186) > Job Submission failed with exception 'java.io.FileNotFoundException(File does > not exist: /hive/ranjan_test/part-00000)' > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.MapRedTask > > > Any help? The AWS credentials seem good, 'cause otherwise I wouldn't get the > initial stuff. Should I be doing something with the other machines in the > cluster? > > Thanks in advance, > > Ranjan