On Thu, Aug 30, 2012 at 1:25 PM, <suman.adda...@sanofipasteur.com> wrote:
> Thank you Joe. It works now. I will try to read up on the differences > between CombineHiveInputFormat and HiveInputFormat.**** > > ** > I suspect this is a bug, but I'm on such an old version of Hive that I haven't bothered to look into it any further since we have this workaround. > ** > > *From:* Joe Crobak [mailto:joec...@gmail.com] > *Sent:* Tuesday, August 28, 2012 10:22 PM > *To:* user@hive.apache.org > *Subject:* Re: Hive on Amazon EC2 with S3**** > > ** ** > > Hi Suman,**** > > ** ** > > We've seen this happen due to a bug in Hive's CombineHiveInputFormat. Try > disabling that before querying by issuing:**** > > ** ** > > SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;**** > > ** ** > > HTH,**** > > Joe**** > > ** ** > > On Fri, Aug 24, 2012 at 4:43 PM, <suman.adda...@sanofipasteur.com> wrote:* > *** > > Hi,**** > > I have setup a Hadoop cluster on Amazon EC2 with my data stored on S3. I > would like to use Hive to process the data on S3.**** > > **** > > I created an external table in hive using the following:**** > > CREATE EXTERNAL TABLE mytable1**** > > (**** > > HIT_TIME_GMT string,**** > > SERVICE string**** > > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'**** > > LOCATION 's3n://com.xxxxx.webanalytics/hive/';**** > > **** > > I loaded a few records into the table (LOAD DATA LOCAL INPATH > '/home/ubuntu/data/play/test' INTO TABLE mytable1;) .**** > > **** > > Select * from mytable1; shows me the data in the table.**** > > **** > > When I try to run the query which requires a map-reduce job to be run, for > example, select count(*) from mytable1; I see an exception thrown.**** > > Total MapReduce jobs = 1**** > > Launching Job 1 out of 1**** > > Number of reduce tasks determined at compile time: 1**** > > In order to change the average load for a reducer (in bytes):**** > > set hive.exec.reducers.bytes.per.reducer=<number>**** > > In order to limit the maximum number of reducers:**** > > set hive.exec.reducers.max=<number>**** > > In order to set a constant number of reducers:**** > > set mapred.reduce.tasks=<number>**** > > java.io.FileNotFoundException: File does not exist: /hive/test**** > > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:527) > **** > > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:462) > **** > > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:256) > **** > > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:212) > **** > > at > org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:347) > **** > > at > org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:313) > **** > > at > org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:377) > **** > > at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1026)**** > > at > org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1018)**** > > at > org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)**** > > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:929)*** > * > > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882)*** > * > > at java.security.AccessController.doPrivileged(Native Method)**** > > at javax.security.auth.Subject.doAs(Subject.java:415)**** > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) > **** > > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882)** > ** > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:856) > **** > > at > org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:671)**** > > at > org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123)**** > > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:131)* > *** > > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) > **** > > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)** > ** > > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)**** > > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)**** > > at > org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)**** > > at > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)**** > > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:516)** > ** > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)**** > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > **** > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > **** > > at java.lang.reflect.Method.invoke(Method.java:601)**** > > at org.apache.hadoop.util.RunJar.main(RunJar.java:197)**** > > Job Submission failed with exception 'java.io.FileNotFoundException(File > does not exist: /hive/test)'**** > > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.MapRedTask**** > > **** > > The file does exist and I can see it on S3. Select * from table is > returning the data in the table. I am not sure what is going wrong when a > map-reduce job is being initiated by the hive query. Any pointer as to > where I went wrong? Appreciate your help.**** > > **** > > Thank you**** > > Suman**** > > ** ** >