Yes, I do that the awsSecretAccessKey defined, correct, I believe. To test:
mashlogic@cruncher ~ [ 8:07AM] hadoop dfs -ls s3n://ml-weblogs/smartlinks/daytsvs/day=20111130/ Found 29 items -rwxrwxrwx 1 139148530 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xaa.tsv.gz -rwxrwxrwx 1 138086136 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xab.tsv.gz -rwxrwxrwx 1 146165298 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xac.tsv.gz -rwxrwxrwx 1 152491197 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xad.tsv.gz -rwxrwxrwx 1 154673351 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xae.tsv.gz -rwxrwxrwx 1 155920643 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xaf.tsv.gz -rwxrwxrwx 1 156468098 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xag.tsv.gz -rwxrwxrwx 1 157626894 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xah.tsv.gz -rwxrwxrwx 1 158872953 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xai.tsv.gz -rwxrwxrwx 1 158108620 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xaj.tsv.gz -rwxrwxrwx 1 158439002 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xak.tsv.gz -rwxrwxrwx 1 158618811 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xal.tsv.gz -rwxrwxrwx 1 159421273 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xam.tsv.gz -rwxrwxrwx 1 158402981 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xan.tsv.gz -rwxrwxrwx 1 157375232 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xao.tsv.gz -rwxrwxrwx 1 158516929 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xap.tsv.gz -rwxrwxrwx 1 158029022 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xaq.tsv.gz -rwxrwxrwx 1 159808270 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xar.tsv.gz -rwxrwxrwx 1 160148777 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xas.tsv.gz -rwxrwxrwx 1 160844640 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xat.tsv.gz -rwxrwxrwx 1 161679424 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xau.tsv.gz -rwxrwxrwx 1 159240120 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xav.tsv.gz -rwxrwxrwx 1 160124996 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xaw.tsv.gz -rwxrwxrwx 1 159158447 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xax.tsv.gz -rwxrwxrwx 1 158436630 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xay.tsv.gz -rwxrwxrwx 1 158518938 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xaz.tsv.gz -rwxrwxrwx 1 156520868 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xba.tsv.gz -rwxrwxrwx 1 154253795 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xbb.tsv.gz -rwxrwxrwx 1 142244585 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xbc.tsv.gz Trying to run something as simple as a = load 's3n://ml-weblogs/smartlinks/daytsvs/day=20111130/' using PigStorage(); s = sample a 0.001; dump s; gives >ERROR 2999: Unexpected internal error. Failed to create DataStorage > >java.lang.RuntimeException: Failed to create DataStorage >at >org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) >at >org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58) >at >org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) >at >org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) >at org.apache.pig.impl.PigContext.connect(PigContext.java:183) >at org.apache.pig.PigServer.<init>(PigServer.java:226) >at org.apache.pig.PigServer.<init>(PigServer.java:215) >at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55) >at org.apache.pig.Main.run(Main.java:452) >at org.apache.pig.Main.main(Main.java:107) >Caused by: java.io.IOException: Call to /10.116.83.74:9000 failed on local >exception: java.io.EOFException >at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142) >at org.apache.hadoop.ipc.Client.call(Client.java:1110) >at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) >at $Proxy0.getProtocolVersion(Unknown Source) >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398) >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384) >at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111) >at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:213) >at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:180) >at >org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) >at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1514) >at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) >at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1548) >at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530) >at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228) >at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:111) >at >org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) >... 9 more >Caused by: java.io.EOFException >at java.io.DataInputStream.readInt(DataInputStream.java:375) >at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:815) >at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724) -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. ________________________________ From: Daniel Dai <[email protected]> To: [email protected]; Ayon Sinha <[email protected]> Sent: Friday, December 2, 2011 1:06 AM Subject: Re: Trying to submit Pig job to Amazon EMR Pig should support this syntax. Do you share your s3 data to public? Otherwise do you have fs.s3.awsAccessKeyId/fs.s3.awsSecretAccessKey defined? Daniel On Thu, Dec 1, 2011 at 4:27 PM, Ayon Sinha <[email protected]> wrote: Well, I should not need Pig to connect to HDFS. Its should use S3, so I changed fs.default.name to >s3n://<mybucketname> and now I get the Grunt prompt. > >The next problem I'm facing is when I say, >a = load 's3n://<mydatabucket>/blah/foo/day=20111127' using PigStorage(); > > >I get > >2011-12-01 16:22:01,948 [main] WARN >org.jets3t.service.impl.rest.httpclient.RestS3Service - Response >'/user%2Fmymapred-user' - Unexpected response code 404, expected 200 >2011-12-01 16:22:02,024 [main] WARN >org.jets3t.service.impl.rest.httpclient.RestS3Service - Response >'/user%2Fmymapred-user_%24folder%24' - Unexpected response code 404, expected >200 >2011-12-01 16:22:02,038 [main] WARN >org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/' - >Unexpected response code 404, expected 200 >2011-12-01 16:22:02,038 [main] WARN >org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/' - >Received error response with XML message >2011-12-01 16:22:02,045 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR >6007: Unable to check name s3n://<mybucketname/user/mymapred-user > > >What is it trying to check? Does it need some storage to write intermediate >files to? > > >-Ayon >See My Photos on Flickr >Also check out my Blog for answers to commonly asked questions. > > > > >________________________________ > From: Jonathan Coveney <[email protected]> >To: [email protected]; Ayon Sinha <[email protected]> >Sent: Thursday, December 1, 2011 4:17 PM >Subject: Re: Trying to submit Pig job to Amazon EMR > > > >Usually this means that the version of Hadoop in pig mismatches with the >version of Hadoop you're running. I'd do ant jar-withouthadoop and point it at >the HAdoop on EC2 using the hadoopless pig jar > > >2011/12/1 Ayon Sinha <[email protected]> > >Hi, >>I have a EC2 box setup with Pig 0.8.1 which can run my jobs fine in local >>mode. So now I want to configure the NN & JT such that the job goes to the >>EMR cluster I've spun up. >>I have a local pigconf directory with the Hadoop XML files and pointed >>HADOOP_CONF_DIR and PIG_CLASSPATH set to it. >> >>in core-site.xml I have >> >> <property> >> <name>fs.default.name</name> >> <value>hdfs://10.116.83.74:9000</value> >> </property> >> >> >>On mapred-site.xml I have: >><configuration> >> <property> >> <name>mapred.job.tracker</name> >> <value>10.116.83.74:9001</value> >> </property> >> >> >>Now Pig tries to connect and I get >>2011-12-01 16:10:58,009 [main] INFO org.apache.pig.Main - Logging error >>messages to: /home/mashlogic/ayon/pigconf/pig_1322784657959.log >>2011-12-01 16:10:58,950 [main] INFO >>org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting >>to hadoop file system at: hdfs://10.116.83.74:9000 >>2011-12-01 16:10:59,814 [main] ERROR org.apache.pig.Main - ERROR 2999: >>Unexpected internal error. Failed to create DataStorage >> >> >>log file says: >> >>Error before Pig is launched >>---------------------------- >>ERROR 2999: Unexpected internal error. Failed to create DataStorage >> >>java.lang.RuntimeException: Failed to create DataStorage >>at >>org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) >>at >>org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58) >>at >>org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) >>at >>org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) >>at org.apache.pig.impl.PigContext.connect(PigContext.java:183) >>at org.apache.pig.PigServer.<init>(PigServer.java:226) >>at org.apache.pig.PigServer.<init>(PigServer.java:215) >>at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55) >>at org.apache.pig.Main.run(Main.java:452) >>at org.apache.pig.Main.main(Main.java:107) >>Caused by: java.io.IOException: Call to /10.116.83.74:9000 failed on local >>exception: java.io.EOFException >>at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142) >>at org.apache.hadoop.ipc.Client.call(Client.java:1110) >>at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) >>at $Proxy0.getProtocolVersion(Unknown Source) >>at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398) >>at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384) >>at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111) >>at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:213) >>at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:180) >>at >>org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) >>at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1514) >>at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) >>at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1548) >>at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530) >>at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228) >>at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:111) >>at >>org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) >>... 9 more >>Caused by: java.io.EOFException >>at java.io.DataInputStream.readInt(DataInputStream.java:375) >>at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:815) >>at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724) >>================================================================================ >> >>My EMR is running Hive jobs just fine. So if I can get it to run my Pig jobs, >>I'll be happy. >> >>-Ayon >>See My Photos on Flickr >>Also check out my Blog for answers to commonly asked questions. >>
