So with the help of Daniel and Thejas, we figured out the problem. The root 
cause was the mismatch of Hadoop versions between EMR and the Pig client. When 
I copied over all the hadoop jars from the EMR box to the EC2 Pig 0.8.1 client 
EC2 box, it still did not resolve the issue. The root cause of that was that, 
Pig 0.8.1 uses hadoop classes from within its own packaged jar. Version 0.9 has 
pigwithouthadoop jar so we used that.

Also, the bin/pig script has a bug that resets HADOOP_HOME. The script was also 
patched to fix this.

Then also Pig will look for /user/<username> directory in the HDFS of the EMR 
cluster. So one way is to create the directory in the HDFS and then let Pig do 
its job. I'm not sure why Pig can't create that directory if its doesn't exist. 
Will investigate that.

Thanks to Daniel & Thejas once again.
 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.



________________________________
 From: Ayon Sinha <[email protected]>
To: Daniel Dai <[email protected]>; "[email protected]" 
<[email protected]> 
Sent: Friday, December 2, 2011 8:15 AM
Subject: Re: Trying to submit Pig job to Amazon EMR
 
Yes, I do that the awsSecretAccessKey defined, correct, I believe.
To test:

mashlogic@cruncher ~ [ 8:07AM] hadoop dfs -ls 
s3n://ml-weblogs/smartlinks/daytsvs/day=20111130/
Found 29 items
-rwxrwxrwx   1  139148530 2011-12-01 07:03 
/smartlinks/daytsvs/day=20111130/xaa.tsv.gz
-rwxrwxrwx   1  138086136 2011-12-01 07:03 
/smartlinks/daytsvs/day=20111130/xab.tsv.gz
-rwxrwxrwx   1  146165298 2011-12-01 07:03 
/smartlinks/daytsvs/day=20111130/xac.tsv.gz
-rwxrwxrwx   1  152491197 2011-12-01 07:03 
/smartlinks/daytsvs/day=20111130/xad.tsv.gz
-rwxrwxrwx   1  154673351 2011-12-01 07:03 
/smartlinks/daytsvs/day=20111130/xae.tsv.gz
-rwxrwxrwx   1  155920643 2011-12-01 07:03 
/smartlinks/daytsvs/day=20111130/xaf.tsv.gz
-rwxrwxrwx   1  156468098 2011-12-01 07:03 
/smartlinks/daytsvs/day=20111130/xag.tsv.gz
-rwxrwxrwx   1  157626894 2011-12-01 07:03
 /smartlinks/daytsvs/day=20111130/xah.tsv.gz
-rwxrwxrwx   1  158872953 2011-12-01 07:04 
/smartlinks/daytsvs/day=20111130/xai.tsv.gz
-rwxrwxrwx   1  158108620 2011-12-01 07:04 
/smartlinks/daytsvs/day=20111130/xaj.tsv.gz
-rwxrwxrwx   1  158439002 2011-12-01 07:04 
/smartlinks/daytsvs/day=20111130/xak.tsv.gz
-rwxrwxrwx   1  158618811 2011-12-01 07:04 
/smartlinks/daytsvs/day=20111130/xal.tsv.gz
-rwxrwxrwx   1  159421273 2011-12-01 07:04 
/smartlinks/daytsvs/day=20111130/xam.tsv.gz
-rwxrwxrwx   1  158402981 2011-12-01 07:04 
/smartlinks/daytsvs/day=20111130/xan.tsv.gz
-rwxrwxrwx   1  157375232 2011-12-01 07:04 
/smartlinks/daytsvs/day=20111130/xao.tsv.gz
-rwxrwxrwx   1  158516929 2011-12-01 07:05 
/smartlinks/daytsvs/day=20111130/xap.tsv.gz
-rwxrwxrwx   1  158029022 2011-12-01 07:05 
/smartlinks/daytsvs/day=20111130/xaq.tsv.gz
-rwxrwxrwx   1
  159808270 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xar.tsv.gz
-rwxrwxrwx   1  160148777 2011-12-01 07:05 
/smartlinks/daytsvs/day=20111130/xas.tsv.gz
-rwxrwxrwx   1  160844640 2011-12-01 07:05 
/smartlinks/daytsvs/day=20111130/xat.tsv.gz
-rwxrwxrwx   1  161679424 2011-12-01 07:05 
/smartlinks/daytsvs/day=20111130/xau.tsv.gz
-rwxrwxrwx   1  159240120 2011-12-01 07:05 
/smartlinks/daytsvs/day=20111130/xav.tsv.gz
-rwxrwxrwx   1  160124996 2011-12-01 07:06 
/smartlinks/daytsvs/day=20111130/xaw.tsv.gz
-rwxrwxrwx   1  159158447 2011-12-01 07:06 
/smartlinks/daytsvs/day=20111130/xax.tsv.gz
-rwxrwxrwx   1  158436630 2011-12-01 07:06 
/smartlinks/daytsvs/day=20111130/xay.tsv.gz
-rwxrwxrwx   1  158518938 2011-12-01 07:06 
/smartlinks/daytsvs/day=20111130/xaz.tsv.gz
-rwxrwxrwx   1  156520868 2011-12-01 07:06
 /smartlinks/daytsvs/day=20111130/xba.tsv.gz
-rwxrwxrwx   1  154253795 2011-12-01 07:06 
/smartlinks/daytsvs/day=20111130/xbb.tsv.gz
-rwxrwxrwx   1  142244585 2011-12-01 07:06 
/smartlinks/daytsvs/day=20111130/xbc.tsv.gz

 
Trying to run something as simple as 
a = load 's3n://ml-weblogs/smartlinks/daytsvs/day=20111130/' using PigStorage();
s = sample a 0.001;
dump s;

gives 
>ERROR 2999: Unexpected internal error. Failed to create DataStorage
>
>java.lang.RuntimeException: Failed to create DataStorage
>at 
>org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
>at 
>org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58)
>at 
>org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
>at
 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
>at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
>at org.apache.pig.PigServer.<init>(PigServer.java:226)
>at org.apache.pig.PigServer.<init>(PigServer.java:215)
>at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
>at org.apache.pig.Main.run(Main.java:452)
>at org.apache.pig.Main.main(Main.java:107)
>Caused by: java.io.IOException: Call to /10.116.83.74:9000 failed on local 
>exception: java.io.EOFException
>at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142)
>at org.apache.hadoop.ipc.Client.call(Client.java:1110)
>at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>at $Proxy0.getProtocolVersion(Unknown Source)
>at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398)
>at
 org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384)
>at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111)
>at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:213)
>at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:180)
>at 
>org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1514)
>at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
>at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1548)
>at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
>at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
>at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:111)
>at 
>org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
>... 9 more
>Caused by:
 java.io.EOFException
>at java.io.DataInputStream.readInt(DataInputStream.java:375)
>at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:815)
>at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724)


-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.



________________________________
From: Daniel Dai <[email protected]>
To: [email protected]; Ayon Sinha <[email protected]> 
Sent: Friday, December 2, 2011 1:06 AM
Subject: Re: Trying to submit Pig job to Amazon EMR


Pig should support this
 syntax. Do you share your s3 data to public? Otherwise do you have 
fs.s3.awsAccessKeyId/fs.s3.awsSecretAccessKey defined?

Daniel


On Thu, Dec 1, 2011 at 4:27 PM, Ayon Sinha <[email protected]> wrote:

Well, I should not need Pig to connect to HDFS. Its should use S3, so I changed 
fs.default.name to 
>s3n://<mybucketname> and now I get the Grunt prompt.
>
>The next problem I'm facing is when I say,
>a = load 's3n://<mydatabucket>/blah/foo/day=20111127' using PigStorage();
>
>
>I get 
>
>2011-12-01 16:22:01,948 [main] WARN  
>org.jets3t.service.impl.rest.httpclient.RestS3Service - Response 
>'/user%2Fmymapred-user' - Unexpected response code 404, expected 200
>2011-12-01 16:22:02,024 [main] WARN
  org.jets3t.service.impl.rest.httpclient.RestS3Service - Response 
'/user%2Fmymapred-user_%24folder%24' - Unexpected response code 404, expected 
200
>2011-12-01 16:22:02,038 [main] WARN  
>org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/' - 
>Unexpected response code 404, expected 200
>2011-12-01 16:22:02,038 [main] WARN  
>org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/' - 
>Received error response with XML message
>2011-12-01 16:22:02,045 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
>6007: Unable to check name s3n://<mybucketname/user/mymapred-user
>
>
>What is it trying to check? Does it need some storage to write intermediate 
>files to?
>
> 
>-Ayon
>See My Photos on Flickr
>Also check out my Blog for answers to commonly asked questions.
>
>
>
>
>________________________________
> From:
 Jonathan Coveney <[email protected]>
>To: [email protected]; Ayon Sinha <[email protected]>
>Sent: Thursday, December 1, 2011 4:17 PM
>Subject: Re: Trying to submit Pig job to Amazon EMR
>
>
>
>Usually this means that the version of Hadoop in pig mismatches with the 
>version of Hadoop you're running. I'd do ant jar-withouthadoop and point it at 
>the HAdoop on EC2 using the hadoopless pig jar
>
>
>2011/12/1 Ayon Sinha <[email protected]>
>
>Hi,
>>I have a EC2 box setup with Pig 0.8.1 which can run my jobs fine in local 
>>mode. So now I want to
 configure the NN & JT such that the job goes to the EMR cluster I've spun up.
>>I have a local pigconf directory with the Hadoop XML files and pointed 
>>HADOOP_CONF_DIR and PIG_CLASSPATH set to it.
>>
>>in core-site.xml I have
>>
>> <property>
>>    <name>fs.default.name</name>
>>    <value>hdfs://10.116.83.74:9000</value>
>>  </property>
>>
>>
>>On mapred-site.xml I have:
>><configuration>
>>  <property>
>>    <name>mapred.job.tracker</name>
>>    <value>10.116.83.74:9001</value>
>>  </property>
>>
>>
>>Now Pig tries to connect and I get 
>>2011-12-01 16:10:58,009 [main] INFO  org.apache.pig.Main - Logging error 
>>messages to:
 /home/mashlogic/ayon/pigconf/pig_1322784657959.log
>>2011-12-01 16:10:58,950 [main] INFO  
>>org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
>>to hadoop file system at: hdfs://10.116.83.74:9000
>>2011-12-01 16:10:59,814 [main] ERROR org.apache.pig.Main - ERROR 2999: 
>>Unexpected internal error. Failed to create DataStorage
>>
>>
>>log file says:
>>
>>Error before Pig is launched
>>----------------------------
>>ERROR 2999: Unexpected internal error. Failed to create DataStorage
>>
>>java.lang.RuntimeException: Failed to create DataStorage
>>at 
>>org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
>>at 
>>org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58)
>>at
 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
>>at 
>>org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
>>at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
>>at org.apache.pig.PigServer.<init>(PigServer.java:226)
>>at org.apache.pig.PigServer.<init>(PigServer.java:215)
>>at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
>>at org.apache.pig.Main.run(Main.java:452)
>>at org.apache.pig.Main.main(Main.java:107)
>>Caused by: java.io.IOException: Call to /10.116.83.74:9000 failed on local 
>>exception: java.io.EOFException
>>at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142)
>>at org.apache.hadoop.ipc.Client.call(Client.java:1110)
>>at
 org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>>at $Proxy0.getProtocolVersion(Unknown Source)
>>at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398)
>>at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384)
>>at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111)
>>at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:213)
>>at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:180)
>>at 
>>org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>>at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1514)
>>at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
>>at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1548)
>>at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
>>at
 org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
>>at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:111)
>>at 
>>org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
>>... 9 more
>>Caused by: java.io.EOFException
>>at java.io.DataInputStream.readInt(DataInputStream.java:375)
>>at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:815)
>>at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724)
>>================================================================================
>>
>>My EMR is running Hive jobs just fine. So if I can get it to run my Pig jobs, 
>>I'll be happy.
>> 
>>-Ayon
>>See My Photos on Flickr
>>Also check out my Blog for answers to commonly asked questions.
>>

Reply via email to