Looks like I'm running into a problem I hadn't seen before. Pig is 9.1. Hadoop is the same version as on EMR. The conf is being picked up so that it connects to the EMR NN and JT. Now I get this:
/home/mashlogic/ayon/hadoop-0.20.0 2011-12-05 10:56:58,200 [main] INFO org.apache.pig.Main - Logging error messages to: /home/mashlogic/ayon/pig_1323111418198.log 2011-12-05 10:56:58,398 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: 10.203.6.84:9000 2011-12-05 10:56:58,402 [main] WARN org.apache.hadoop.fs.FileSystem - "10.203.6.84:9000" is a deprecated filesystem name. Use "hdfs://10.203.6.84:9000/" instead. 2011-12-05 10:56:58,531 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.203.6.84:9001 2011-12-05 10:56:58,532 [main] WARN org.apache.hadoop.fs.FileSystem - "10.203.6.84:9000" is a deprecated filesystem name. Use "hdfs://10.203.6.84:9000/" instead. grunt> a = load 's3n://ml-weblogs/smartlinks/daytsvs/day=20111130' using PigStorage(); 2011-12-05 10:57:18,078 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 1, column 4> pig script failed to validate: java.net.URISyntaxException: Illegal character in scheme name at index 0: 10.203.6.84:9000 What is going on here? -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. ________________________________ From: Ayon Sinha <[email protected]> To: "[email protected]" <[email protected]> Sent: Friday, December 2, 2011 8:01 PM Subject: Re: Trying to submit Pig job to Amazon EMR So with the help of Daniel and Thejas, we figured out the problem. The root cause was the mismatch of Hadoop versions between EMR and the Pig client. When I copied over all the hadoop jars from the EMR box to the EC2 Pig 0.8.1 client EC2 box, it still did not resolve the issue. The root cause of that was that, Pig 0.8.1 uses hadoop classes from within its own packaged jar. Version 0.9 has pigwithouthadoop jar so we used that. Also, the bin/pig script has a bug that resets HADOOP_HOME. The script was also patched to fix this. Then also Pig will look for /user/<username> directory in the HDFS of the EMR cluster. So one way is to create the directory in the HDFS and then let Pig do its job. I'm not sure why Pig can't create that directory if its doesn't exist. Will investigate that. Thanks to Daniel & Thejas once again. -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. ________________________________ From: Ayon Sinha <[email protected]> To: Daniel Dai <[email protected]>; "[email protected]" <[email protected]> Sent: Friday, December 2, 2011 8:15 AM Subject: Re: Trying to submit Pig job to Amazon EMR Yes, I do that the awsSecretAccessKey defined, correct, I believe. To test: mashlogic@cruncher ~ [ 8:07AM] hadoop dfs -ls s3n://ml-weblogs/smartlinks/daytsvs/day=20111130/ Found 29 items -rwxrwxrwx 1 139148530 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xaa.tsv.gz -rwxrwxrwx 1 138086136 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xab.tsv.gz -rwxrwxrwx 1 146165298 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xac.tsv.gz -rwxrwxrwx 1 152491197 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xad.tsv.gz -rwxrwxrwx 1 154673351 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xae.tsv.gz -rwxrwxrwx 1 155920643 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xaf.tsv.gz -rwxrwxrwx 1 156468098 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xag.tsv.gz -rwxrwxrwx 1 157626894 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xah.tsv.gz -rwxrwxrwx 1 158872953 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xai.tsv.gz -rwxrwxrwx 1 158108620 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xaj.tsv.gz -rwxrwxrwx 1 158439002 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xak.tsv.gz -rwxrwxrwx 1 158618811 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xal.tsv.gz -rwxrwxrwx 1 159421273 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xam.tsv.gz -rwxrwxrwx 1 158402981 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xan.tsv.gz -rwxrwxrwx 1 157375232 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xao.tsv.gz -rwxrwxrwx 1 158516929 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xap.tsv.gz -rwxrwxrwx 1 158029022 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xaq.tsv.gz -rwxrwxrwx 1 159808270 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xar.tsv.gz -rwxrwxrwx 1 160148777 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xas.tsv.gz -rwxrwxrwx 1 160844640 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xat.tsv.gz -rwxrwxrwx 1 161679424 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xau.tsv.gz -rwxrwxrwx 1 159240120 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xav.tsv.gz -rwxrwxrwx 1 160124996 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xaw.tsv.gz -rwxrwxrwx 1 159158447 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xax.tsv.gz -rwxrwxrwx 1 158436630 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xay.tsv.gz -rwxrwxrwx 1 158518938 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xaz.tsv.gz -rwxrwxrwx 1 156520868 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xba.tsv.gz -rwxrwxrwx 1 154253795 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xbb.tsv.gz -rwxrwxrwx 1 142244585 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xbc.tsv.gz Trying to run something as simple as a = load 's3n://ml-weblogs/smartlinks/daytsvs/day=20111130/' using PigStorage(); s = sample a 0.001; dump s; gives >ERROR 2999: Unexpected internal error. Failed to create DataStorage > >java.lang.RuntimeException: Failed to create DataStorage >at >org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) >at >org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58) >at >org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) >at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) >at org.apache.pig.impl.PigContext.connect(PigContext.java:183) >at org.apache.pig.PigServer.<init>(PigServer.java:226) >at org.apache.pig.PigServer.<init>(PigServer.java:215) >at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55) >at org.apache.pig.Main.run(Main.java:452) >at org.apache.pig.Main.main(Main.java:107) >Caused by: java.io.IOException: Call to /10.116.83.74:9000 failed on local >exception: java.io.EOFException >at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142) >at org.apache.hadoop.ipc.Client.call(Client.java:1110) >at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) >at $Proxy0.getProtocolVersion(Unknown Source) >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398) >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384) >at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111) >at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:213) >at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:180) >at >org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) >at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1514) >at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) >at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1548) >at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530) >at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228) >at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:111) >at >org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) >... 9 more >Caused by: java.io.EOFException >at java.io.DataInputStream.readInt(DataInputStream.java:375) >at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:815) >at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724) -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. ________________________________ From: Daniel Dai <[email protected]> To: [email protected]; Ayon Sinha <[email protected]> Sent: Friday, December 2, 2011 1:06 AM Subject: Re: Trying to submit Pig job to Amazon EMR Pig should support this syntax. Do you share your s3 data to public? Otherwise do you have fs.s3.awsAccessKeyId/fs.s3.awsSecretAccessKey defined? Daniel On Thu, Dec 1, 2011 at 4:27 PM, Ayon Sinha <[email protected]> wrote: Well, I should not need Pig to connect to HDFS. Its should use S3, so I changed fs.default.name to >s3n://<mybucketname> and now I get the Grunt prompt. > >The next problem I'm facing is when I say, >a = load 's3n://<mydatabucket>/blah/foo/day=20111127' using PigStorage(); > > >I get > >2011-12-01 16:22:01,948 [main] WARN >org.jets3t.service.impl.rest.httpclient.RestS3Service - Response >'/user%2Fmymapred-user' - Unexpected response code 404, expected 200 >2011-12-01 16:22:02,024 [main] WARN org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/user%2Fmymapred-user_%24folder%24' - Unexpected response code 404, expected 200 >2011-12-01 16:22:02,038 [main] WARN >org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/' - >Unexpected response code 404, expected 200 >2011-12-01 16:22:02,038 [main] WARN >org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/' - >Received error response with XML message >2011-12-01 16:22:02,045 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR >6007: Unable to check name s3n://<mybucketname/user/mymapred-user > > >What is it trying to check? Does it need some storage to write intermediate >files to? > > >-Ayon >See My Photos on Flickr >Also check out my Blog for answers to commonly asked questions. > > > > >________________________________ > From: Jonathan Coveney <[email protected]> >To: [email protected]; Ayon Sinha <[email protected]> >Sent: Thursday, December 1, 2011 4:17 PM >Subject: Re: Trying to submit Pig job to Amazon EMR > > > >Usually this means that the version of Hadoop in pig mismatches with the >version of Hadoop you're running. I'd do ant jar-withouthadoop and point it at >the HAdoop on EC2 using the hadoopless pig jar > > >2011/12/1 Ayon Sinha <[email protected]> > >Hi, >>I have a EC2 box setup with Pig 0.8.1 which can run my jobs fine in local >>mode. So now I want to configure the NN & JT such that the job goes to the EMR cluster I've spun up. >>I have a local pigconf directory with the Hadoop XML files and pointed >>HADOOP_CONF_DIR and PIG_CLASSPATH set to it. >> >>in core-site.xml I have >> >> <property> >> <name>fs.default.name</name> >> <value>hdfs://10.116.83.74:9000</value> >> </property> >> >> >>On mapred-site.xml I have: >><configuration> >> <property> >> <name>mapred.job.tracker</name> >> <value>10.116.83.74:9001</value> >> </property> >> >> >>Now Pig tries to connect and I get >>2011-12-01 16:10:58,009 [main] INFO org.apache.pig.Main - Logging error >>messages to: /home/mashlogic/ayon/pigconf/pig_1322784657959.log >>2011-12-01 16:10:58,950 [main] INFO >>org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting >>to hadoop file system at: hdfs://10.116.83.74:9000 >>2011-12-01 16:10:59,814 [main] ERROR org.apache.pig.Main - ERROR 2999: >>Unexpected internal error. Failed to create DataStorage >> >> >>log file says: >> >>Error before Pig is launched >>---------------------------- >>ERROR 2999: Unexpected internal error. Failed to create DataStorage >> >>java.lang.RuntimeException: Failed to create DataStorage >>at >>org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) >>at >>org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58) >>at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) >>at >>org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) >>at org.apache.pig.impl.PigContext.connect(PigContext.java:183) >>at org.apache.pig.PigServer.<init>(PigServer.java:226) >>at org.apache.pig.PigServer.<init>(PigServer.java:215) >>at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55) >>at org.apache.pig.Main.run(Main.java:452) >>at org.apache.pig.Main.main(Main.java:107) >>Caused by: java.io.IOException: Call to /10.116.83.74:9000 failed on local >>exception: java.io.EOFException >>at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142) >>at org.apache.hadoop.ipc.Client.call(Client.java:1110) >>at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) >>at $Proxy0.getProtocolVersion(Unknown Source) >>at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398) >>at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384) >>at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111) >>at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:213) >>at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:180) >>at >>org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) >>at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1514) >>at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) >>at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1548) >>at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530) >>at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228) >>at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:111) >>at >>org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) >>... 9 more >>Caused by: java.io.EOFException >>at java.io.DataInputStream.readInt(DataInputStream.java:375) >>at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:815) >>at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724) >>================================================================================ >> >>My EMR is running Hive jobs just fine. So if I can get it to run my Pig jobs, >>I'll be happy. >> >>-Ayon >>See My Photos on Flickr >>Also check out my Blog for answers to commonly asked questions. >>
