Re: Help with a table located on s3n

Ranjan Bagchi Fri, 16 Dec 2011 12:53:11 -0800

I just tried it on EMR and it worked as expected.    Is there a verbose mode 
which can help debug what's going on with our EC2-based configuration?


Ranjan

On Dec 16, 2011, at 10:55 AM, Mark Grover wrote:

> Hi Ranjan,
> I agree with Igor. I consider it a good practice to point the location to the 
> directory containing the file instead of the file itself.
> 
> It's probably some config option that's causing you this problem (You can try 
> using s3 instead of s3n in your paths). It's definitely possible to find it 
> but moving to EMR might be a more time conscious choice. I personally use EMR 
> (not EC2) to do my analysis and it was pretty smooth to get it to work. I 
> would encourage you to give that a shot.
> 
> Mark Grover, Business Intelligence Analyst
> OANDA Corporation 
> 
> www: oanda.com www: fxtrade.com 
> e: mgro...@oanda.com 
> 
> "Best Trading Platform" - World Finance's Forex Awards 2009. 
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009. 
> 
> 
> ----- Original Message -----
> From: "Ranjan Bagchi" <ran...@powerreviews.com>
> To: user@hive.apache.org
> Sent: Friday, December 16, 2011 1:43:17 PM
> Subject: Re: Help with a table located on s3n
> 
> Following up with more information: 
> 
> 
> * The hadoop cluster is on EC2, not EMR, but I'll try bringing it up on EMR. 
> * I don't see a job conf on the tracker page -- I'm semi-suspicious it never 
> makes it that far. 
> * Here's the extended explain plan: it doesn't look glaringly wrong. 
> 
> 
> Totally appreciate any help, 
> 
> 
> Ranjan 
> 
> 
> 
> hive> explain extended select count(*) from ranjan_test; 
> OK 
> ABSTRACT SYNTAX TREE: 
> (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME ranjan_test))) (TOK_INSERT 
> (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 
> (TOK_FUNCTIONSTAR count))))) 
> 
> 
> STAGE DEPENDENCIES: 
> Stage-1 is a root stage 
> Stage-0 is a root stage 
> 
> 
> STAGE PLANS: 
> Stage: Stage-1 
> Map Reduce 
> Alias -> Map Operator Tree: 
> ranjan_test 
> TableScan 
> alias: ranjan_test 
> GatherStats: false 
> Select Operator 
> Group By Operator 
> aggregations: 
> expr: count() 
> bucketGroup: false 
> mode: hash 
> outputColumnNames: _col0 
> Reduce Output Operator 
> sort order: 
> tag: -1 
> value expressions: 
> expr: _col0 
> type: bigint 
> Needs Tagging: false 
> Path -> Alias: 
> s3n://my.bucket/hive/ranjan_test [ranjan_test] 
> Path -> Partition: 
> s3n://my.bucket/hive/ranjan_test 
> Partition 
> base file name: ranjan_test 
> input format: org.apache.hadoop.mapred.TextInputFormat 
> output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 
> properties: 
> EXTERNAL TRUE 
> bucket_count -1 
> columns ip_address,num_counted 
> columns.types string:int 
> file.inputformat org.apache.hadoop.mapred.TextInputFormat 
> file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 
> location s3n://my.bucket/hive/ranjan_test 
> name default.ranjan_test 
> serialization.ddl struct ranjan_test { string ip_address, i32 num_counted} 
> serialization.format 1 
> serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 
> transient_lastDdlTime 1323982126 
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 
> 
> input format: org.apache.hadoop.mapred.TextInputFormat 
> output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 
> properties: 
> EXTERNAL TRUE 
> bucket_count -1 
> columns ip_address,num_counted 
> columns.types string:int 
> file.inputformat org.apache.hadoop.mapred.TextInputFormat 
> file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 
> location s3n://my.bucket/hive/ranjan_test 
> name default.ranjan_test 
> serialization.ddl struct ranjan_test { string ip_address, i32 num_counted} 
> serialization.format 1 
> serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 
> transient_lastDdlTime 1323982126 
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 
> name: default.ranjan_test 
> name: default.ranjan_test 
> Reduce Operator Tree: 
> Group By Operator 
> aggregations: 
> expr: count(VALUE._col0) 
> bucketGroup: false 
> mode: mergepartial 
> outputColumnNames: _col0 
> Select Operator 
> expressions: 
> expr: _col0 
> type: bigint 
> outputColumnNames: _col0 
> File Output Operator 
> compressed: false 
> GlobalTableId: 0 
> directory: 
> hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001
>  
> NumFilesPerFileSink: 1 
> Stats Publishing Key Prefix: 
> hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001/
>  
> table: 
> input format: org.apache.hadoop.mapred.TextInputFormat 
> output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 
> properties: 
> columns _col0 
> columns.types bigint 
> serialization.format 1 
> TotalFiles: 1 
> GatherStats: false 
> MultiFileSpray: false 
> 
> 
> Stage: Stage-0 
> Fetch Operator 
> limit: -1 
> 
> 
> 
> 
> Time taken: 0.156 seconds 
> 
> 
> 
> On Dec 15, 2011, at 5:30 PM, Ranjan Bagchi wrote: 
> 
> 
> Hi, 
> 
> I'm experiencing the following: 
> 
> I've a file on s3 -- s3n://my.bucket/hive/ranjan_test . It's got fields 
> (separated by \001) and records (separated by \n). 
> 
> I want it to be accessible on hive, the ddl is: 
> CREATE EXTERNAL TABLE IF NOT EXISTS ranjan_test ( 
> ip_address string, 
> num_counted int 
> ) 
> STORED AS TEXTFILE 
> LOCATION 's3n://my.bucket/hive/ranjan_test' 
> 
> I'm able to do a simple query: 
> 
> hive> select * from ranjan_test limit 5; 
> OK 
> 98.226.198.23 1676 
> 74.76.148.21 1560 
> 76.64.28.25 1529 
> 170.37.227.10 1363 
> 71.202.128.196 1232 
> Time taken: 4.172 seconds 
> 
> What I can't do is any select which fires off a mapreduce: 
> 
> ive> select count(*) from ranjan_test; 
> Total MapReduce jobs = 1 
> Launching Job 1 out of 1 
> Number of reduce tasks determined at compile time: 1 
> In order to change the average load for a reducer (in bytes): 
> set hive.exec.reducers.bytes.per.reducer=<number> 
> In order to limit the maximum number of reducers: 
> set hive.exec.reducers.max=<number> 
> In order to set a constant number of reducers: 
> set mapred.reduce.tasks=<number> 
> java.io.FileNotFoundException: File does not exist: 
> /hive/ranjan_test/part-00000 
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:546)
>  
> at 
> org.apache.hadoop.mapred.lib.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:462)
>  
> at 
> org.apache.hadoop.mapred.lib.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:256)
>  
> at 
> org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:212)
>  
> at 
> org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:347)
>  
> at 
> org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:313)
>  
> at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:377)
>  
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971) 
> at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963) 
> at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) 
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) 
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:396) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
>  
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) 
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807) 
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:671) 
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123) 
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130) 
> at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) 
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063) 
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900) 
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748) 
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209) 
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286) 
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:513) 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  
> at java.lang.reflect.Method.invoke(Method.java:597) 
> at org.apache.hadoop.util.RunJar.main(RunJar.java:186) 
> Job Submission failed with exception 'java.io.FileNotFoundException(File does 
> not exist: /hive/ranjan_test/part-00000)' 
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.MapRedTask 
> 
> 
> Any help? The AWS credentials seem good, 'cause otherwise I wouldn't get the 
> initial stuff. Should I be doing something with the other machines in the 
> cluster? 
> 
> Thanks in advance, 
> 
> Ranjan

Re: Help with a table located on s3n

Reply via email to