Re: Help with a table located on s3n

Mark Grover Fri, 16 Dec 2011 10:56:04 -0800

Hi Ranjan,
I agree with Igor. I consider it a good practice to point the location to the 
directory containing the file instead of the file itself.


It's probably some config option that's causing you this problem (You can try 
using s3 instead of s3n in your paths). It's definitely possible to find it but 
moving to EMR might be a more time conscious choice. I personally use EMR (not 
EC2) to do my analysis and it was pretty smooth to get it to work. I would 
encourage you to give that a shot.

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgro...@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


----- Original Message -----
From: "Ranjan Bagchi" <ran...@powerreviews.com>
To: user@hive.apache.org
Sent: Friday, December 16, 2011 1:43:17 PM
Subject: Re: Help with a table located on s3n

Following up with more information: 


* The hadoop cluster is on EC2, not EMR, but I'll try bringing it up on EMR. 
* I don't see a job conf on the tracker page -- I'm semi-suspicious it never 
makes it that far. 
* Here's the extended explain plan: it doesn't look glaringly wrong. 


Totally appreciate any help, 


Ranjan 



hive> explain extended select count(*) from ranjan_test; 
OK 
ABSTRACT SYNTAX TREE: 
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME ranjan_test))) (TOK_INSERT 
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 
(TOK_FUNCTIONSTAR count))))) 


STAGE DEPENDENCIES: 
Stage-1 is a root stage 
Stage-0 is a root stage 


STAGE PLANS: 
Stage: Stage-1 
Map Reduce 
Alias -> Map Operator Tree: 
ranjan_test 
TableScan 
alias: ranjan_test 
GatherStats: false 
Select Operator 
Group By Operator 
aggregations: 
expr: count() 
bucketGroup: false 
mode: hash 
outputColumnNames: _col0 
Reduce Output Operator 
sort order: 
tag: -1 
value expressions: 
expr: _col0 
type: bigint 
Needs Tagging: false 
Path -> Alias: 
s3n://my.bucket/hive/ranjan_test [ranjan_test] 
Path -> Partition: 
s3n://my.bucket/hive/ranjan_test 
Partition 
base file name: ranjan_test 
input format: org.apache.hadoop.mapred.TextInputFormat 
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 
properties: 
EXTERNAL TRUE 
bucket_count -1 
columns ip_address,num_counted 
columns.types string:int 
file.inputformat org.apache.hadoop.mapred.TextInputFormat 
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 
location s3n://my.bucket/hive/ranjan_test 
name default.ranjan_test 
serialization.ddl struct ranjan_test { string ip_address, i32 num_counted} 
serialization.format 1 
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 
transient_lastDdlTime 1323982126 
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 

input format: org.apache.hadoop.mapred.TextInputFormat 
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 
properties: 
EXTERNAL TRUE 
bucket_count -1 
columns ip_address,num_counted 
columns.types string:int 
file.inputformat org.apache.hadoop.mapred.TextInputFormat 
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 
location s3n://my.bucket/hive/ranjan_test 
name default.ranjan_test 
serialization.ddl struct ranjan_test { string ip_address, i32 num_counted} 
serialization.format 1 
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 
transient_lastDdlTime 1323982126 
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 
name: default.ranjan_test 
name: default.ranjan_test 
Reduce Operator Tree: 
Group By Operator 
aggregations: 
expr: count(VALUE._col0) 
bucketGroup: false 
mode: mergepartial 
outputColumnNames: _col0 
Select Operator 
expressions: 
expr: _col0 
type: bigint 
outputColumnNames: _col0 
File Output Operator 
compressed: false 
GlobalTableId: 0 
directory: 
hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001
 
NumFilesPerFileSink: 1 
Stats Publishing Key Prefix: 
hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001/
 
table: 
input format: org.apache.hadoop.mapred.TextInputFormat 
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 
properties: 
columns _col0 
columns.types bigint 
serialization.format 1 
TotalFiles: 1 
GatherStats: false 
MultiFileSpray: false 


Stage: Stage-0 
Fetch Operator 
limit: -1 




Time taken: 0.156 seconds 



On Dec 15, 2011, at 5:30 PM, Ranjan Bagchi wrote: 


Hi, 

I'm experiencing the following: 

I've a file on s3 -- s3n://my.bucket/hive/ranjan_test . It's got fields 
(separated by \001) and records (separated by \n). 

I want it to be accessible on hive, the ddl is: 
CREATE EXTERNAL TABLE IF NOT EXISTS ranjan_test ( 
ip_address string, 
num_counted int 
) 
STORED AS TEXTFILE 
LOCATION 's3n://my.bucket/hive/ranjan_test' 

I'm able to do a simple query: 

hive> select * from ranjan_test limit 5; 
OK 
98.226.198.23 1676 
74.76.148.21 1560 
76.64.28.25 1529 
170.37.227.10 1363 
71.202.128.196 1232 
Time taken: 4.172 seconds 

What I can't do is any select which fires off a mapreduce: 

ive> select count(*) from ranjan_test; 
Total MapReduce jobs = 1 
Launching Job 1 out of 1 
Number of reduce tasks determined at compile time: 1 
In order to change the average load for a reducer (in bytes): 
set hive.exec.reducers.bytes.per.reducer=<number> 
In order to limit the maximum number of reducers: 
set hive.exec.reducers.max=<number> 
In order to set a constant number of reducers: 
set mapred.reduce.tasks=<number> 
java.io.FileNotFoundException: File does not exist: 
/hive/ranjan_test/part-00000 
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:546)
 
at 
org.apache.hadoop.mapred.lib.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:462)
 
at 
org.apache.hadoop.mapred.lib.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:256)
 
at 
org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:212)
 
at 
org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:347)
 
at 
org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:313)
 
at 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:377)
 
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971) 
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963) 
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) 
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) 
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:396) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
 
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) 
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807) 
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:671) 
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123) 
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130) 
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) 
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063) 
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900) 
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748) 
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209) 
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286) 
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:513) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 
at java.lang.reflect.Method.invoke(Method.java:597) 
at org.apache.hadoop.util.RunJar.main(RunJar.java:186) 
Job Submission failed with exception 'java.io.FileNotFoundException(File does 
not exist: /hive/ranjan_test/part-00000)' 
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.MapRedTask 


Any help? The AWS credentials seem good, 'cause otherwise I wouldn't get the 
initial stuff. Should I be doing something with the other machines in the 
cluster? 

Thanks in advance, 

Ranjan

Re: Help with a table located on s3n

Reply via email to