Re: Trying to submit Pig job to Amazon EMR

Thejas Nair Mon, 05 Dec 2011 12:17:35 -0800

Can you send the entire stack trace from pig logs ?
-Thejas



On 12/5/11 11:08 AM, Ayon Sinha wrote:

Looks like I'm running into a problem I hadn't seen before.
Pig is 9.1. Hadoop is the same version as on EMR. The conf is being
picked up so that it connects to the EMR NN and JT. Now I get this:

/home/mashlogic/ayon/hadoop-0.20.0
2011-12-05 10:56:58,200 [main] INFO org.apache.pig.Main - Logging error
messages to: /home/mashlogic/ayon/pig_1323111418198.log
2011-12-05 10:56:58,398 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: 10.203.6.84:9000
2011-12-05 10:56:58,402 [main] WARN org.apache.hadoop.fs.FileSystem -
"10.203.6.84:9000" is a deprecated filesystem name. Use
"hdfs://10.203.6.84:9000/" instead.
2011-12-05 10:56:58,531 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to map-reduce job tracker at: 10.203.6.84:9001
2011-12-05 10:56:58,532 [main] WARN org.apache.hadoop.fs.FileSystem -
"10.203.6.84:9000" is a deprecated filesystem name. Use
"hdfs://10.203.6.84:9000/" instead.
grunt> *a = load 's3n://ml-weblogs/smartlinks/daytsvs/day=20111130'
using PigStorage();*
2011-12-05 10:57:18,078 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: Pig script failed to parse:
<line 1, column 4> pig script failed to validate:
java.net.URISyntaxException: Illegal character in scheme name at index
0: 10.203.6.84:9000

What is going on here?
-Ayon
See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/>
Also check out my Blog for answers to commonly asked questions.
<http://dailyadvisor.blogspot.com>

------------------------------------------------------------------------
*From:* Ayon Sinha <[email protected]>
*To:* "[email protected]" <[email protected]>
*Sent:* Friday, December 2, 2011 8:01 PM
*Subject:* Re: Trying to submit Pig job to Amazon EMR

So with the help of Daniel and Thejas, we figured out the problem. The
root cause was the mismatch of Hadoop versions between EMR and the Pig
client. When I copied over all the hadoop jars from the EMR box to the
EC2 Pig 0.8.1 client EC2 box, it still did not resolve the issue. The
root cause of that was that,
Pig 0.8.1 uses hadoop classes from within its own packaged jar. Version
0.9 has pigwithouthadoop jar so we used that.

Also, the bin/pig script has a bug that resets HADOOP_HOME. The script
was also patched to fix this.

Then also Pig will look for /user/<username> directory in the HDFS of
the EMR cluster. So one way is to create the directory in the HDFS and
then let Pig do its job. I'm not sure why Pig can't create that
directory if its doesn't exist. Will investigate that.

Thanks to Daniel & Thejas once again.

-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.



________________________________
From: Ayon Sinha <[email protected] <mailto:[email protected]>>
To: Daniel Dai <[email protected] <mailto:[email protected]>>;
"[email protected] <mailto:[email protected]>" <[email protected]
<mailto:[email protected]>>
Sent: Friday, December 2, 2011 8:15 AM
Subject: Re: Trying to submit Pig job to Amazon EMR

Yes, I do that the awsSecretAccessKey defined, correct, I believe.
To test:

mashlogic@cruncher ~ [ 8:07AM] hadoop dfs -ls
s3n://ml-weblogs/smartlinks/daytsvs/day=20111130/
Found 29 items
-rwxrwxrwx 1 139148530 2011-12-01 07:03
/smartlinks/daytsvs/day=20111130/xaa.tsv.gz
-rwxrwxrwx 1 138086136 2011-12-01 07:03
/smartlinks/daytsvs/day=20111130/xab.tsv.gz
-rwxrwxrwx 1 146165298 2011-12-01 07:03
/smartlinks/daytsvs/day=20111130/xac.tsv.gz
-rwxrwxrwx 1 152491197 2011-12-01 07:03
/smartlinks/daytsvs/day=20111130/xad.tsv.gz
-rwxrwxrwx 1 154673351 2011-12-01 07:03
/smartlinks/daytsvs/day=20111130/xae.tsv.gz
-rwxrwxrwx 1 155920643 2011-12-01 07:03
/smartlinks/daytsvs/day=20111130/xaf.tsv.gz
-rwxrwxrwx 1 156468098 2011-12-01 07:03
/smartlinks/daytsvs/day=20111130/xag.tsv.gz
-rwxrwxrwx 1 157626894 2011-12-01 07:03
/smartlinks/daytsvs/day=20111130/xah.tsv.gz
-rwxrwxrwx 1 158872953 2011-12-01 07:04
/smartlinks/daytsvs/day=20111130/xai.tsv.gz
-rwxrwxrwx 1 158108620 2011-12-01 07:04
/smartlinks/daytsvs/day=20111130/xaj.tsv.gz
-rwxrwxrwx 1 158439002 2011-12-01 07:04
/smartlinks/daytsvs/day=20111130/xak.tsv.gz
-rwxrwxrwx 1 158618811 2011-12-01 07:04
/smartlinks/daytsvs/day=20111130/xal.tsv.gz
-rwxrwxrwx 1 159421273 2011-12-01 07:04
/smartlinks/daytsvs/day=20111130/xam.tsv.gz
-rwxrwxrwx 1 158402981 2011-12-01 07:04
/smartlinks/daytsvs/day=20111130/xan.tsv.gz
-rwxrwxrwx 1 157375232 2011-12-01 07:04
/smartlinks/daytsvs/day=20111130/xao.tsv.gz
-rwxrwxrwx 1 158516929 2011-12-01 07:05
/smartlinks/daytsvs/day=20111130/xap.tsv.gz
-rwxrwxrwx 1 158029022 2011-12-01 07:05
/smartlinks/daytsvs/day=20111130/xaq.tsv.gz
-rwxrwxrwx 1
159808270 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xar.tsv.gz
-rwxrwxrwx 1 160148777 2011-12-01 07:05
/smartlinks/daytsvs/day=20111130/xas.tsv.gz
-rwxrwxrwx 1 160844640 2011-12-01 07:05
/smartlinks/daytsvs/day=20111130/xat.tsv.gz
-rwxrwxrwx 1 161679424 2011-12-01 07:05
/smartlinks/daytsvs/day=20111130/xau.tsv.gz
-rwxrwxrwx 1 159240120 2011-12-01 07:05
/smartlinks/daytsvs/day=20111130/xav.tsv.gz
-rwxrwxrwx 1 160124996 2011-12-01 07:06
/smartlinks/daytsvs/day=20111130/xaw.tsv.gz
-rwxrwxrwx 1 159158447 2011-12-01 07:06
/smartlinks/daytsvs/day=20111130/xax.tsv.gz
-rwxrwxrwx 1 158436630 2011-12-01 07:06
/smartlinks/daytsvs/day=20111130/xay.tsv.gz
-rwxrwxrwx 1 158518938 2011-12-01 07:06
/smartlinks/daytsvs/day=20111130/xaz.tsv.gz
-rwxrwxrwx 1 156520868 2011-12-01 07:06
/smartlinks/daytsvs/day=20111130/xba.tsv.gz
-rwxrwxrwx 1 154253795 2011-12-01 07:06
/smartlinks/daytsvs/day=20111130/xbb.tsv.gz
-rwxrwxrwx 1 142244585 2011-12-01 07:06
/smartlinks/daytsvs/day=20111130/xbc.tsv.gz


Trying to run something as simple as
a = load 's3n://ml-weblogs/smartlinks/daytsvs/day=20111130/' using
PigStorage();
s = sample a 0.001;
dump s;

gives
 >ERROR 2999: Unexpected internal error. Failed to create DataStorage
 >
 >java.lang.RuntimeException: Failed to create DataStorage
 >at
org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
 >at
org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58)
 >at
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
 >at
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
 >at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
 >at org.apache.pig.PigServer.<init>(PigServer.java:226)
 >at org.apache.pig.PigServer.<init>(PigServer.java:215)
 >at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
 >at org.apache.pig.Main.run(Main.java:452)
 >at org.apache.pig.Main.main(Main.java:107)
 >Caused by: java.io <http://java.io.IO>.IOException: Call to
/10.116.83.74:9000 failed on local exception: java.io.EOFException
 >at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142)
 >at org.apache.hadoop.ipc.Client.call(Client.java:1110)
 >at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
 >at $Proxy0.getProtocolVersion(Unknown Source)
 >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398)
 >at
org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384)
 >at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111)
 >at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:213)
 >at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:180)
 >at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
 >at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1514)
 >at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
 >at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1548)
 >at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
 >at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
 >at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:111)
 >at
org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
 >... 9 more
 >Caused by:
java.io.EOFException
 >at java.io
<http://java.io.DataInputStream.re>.DataInputStream.readInt(DataInputStream.java:375)
 >at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:815)
 >at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724)


-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.



________________________________
From: Daniel Dai <[email protected] <mailto:[email protected]>>
To: [email protected] <mailto:[email protected]>; Ayon Sinha
<[email protected] <mailto:[email protected]>>
Sent: Friday, December 2, 2011 1:06 AM
Subject: Re: Trying to submit Pig job to Amazon EMR


Pig should support this
syntax. Do you share your s3 data to public? Otherwise do you have
fs.s3.awsAccessKeyId/fs.s3.awsSecretAccessKey defined?

Daniel


On Thu, Dec 1, 2011 at 4:27 PM, Ayon Sinha <[email protected]
<mailto:[email protected]>> wrote:

Well, I should not need Pig to connect to HDFS. Its should use S3, so I
changed fs.default.name <http://fs.default.name> to
 >s3n://<mybucketname> and now I get the Grunt prompt.
 >
 >The next problem I'm facing is when I say,
 >a = load 's3n://<mydatabucket>/blah/foo/day=20111127' using PigStorage();
 >
 >
 >I get
 >
 >2011-12-01 16:22:01,948 [main] WARN
org.jets3t.service.impl.rest.httpclient.RestS3Service - Response
'/user%2Fmymapred-user' - Unexpected response code 404, expected 200
 >2011-12-01 16:22:02,024 [main] WARN
org.jets3t.service.impl.rest.httpclient.RestS3Service - Response
'/user%2Fmymapred-user_%24folder%24' - Unexpected response code 404,
expected 200
 >2011-12-01 16:22:02,038 [main] WARN
org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/' -
Unexpected response code 404, expected 200
 >2011-12-01 16:22:02,038 [main] WARN
org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/' -
Received error response with XML message
 >2011-12-01 16:22:02,045 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 6007: Unable to check name s3n://<mybucketname/user/mymapred-user
 >
 >
 >What is it trying to check? Does it need some storage to write
intermediate files to?
 >
 >
 >-Ayon
 >See My Photos on Flickr
 >Also check out my Blog for answers to commonly asked questions.
 >
 >
 >
 >
 >________________________________
 > From:
Jonathan Coveney <[email protected] <mailto:[email protected]>>
 >To: [email protected] <mailto:[email protected]>; Ayon Sinha
<[email protected] <mailto:[email protected]>>
 >Sent: Thursday, December 1, 2011 4:17 PM
 >Subject: Re: Trying to submit Pig job to Amazon EMR
 >
 >
 >
 >Usually this means that the version of Hadoop in pig mismatches with
the version of Hadoop you're running. I'd do ant jar-withouthadoop and
point it at the HAdoop on EC2 using the hadoopless pig jar
 >
 >
 >2011/12/1 Ayon Sinha <[email protected] <mailto:[email protected]>>
 >
 >Hi,
 >>I have a EC2 box setup with Pig 0.8.1 which can run my jobs fine in
local mode. So now I want to
configure the NN & JT such that the job goes to the EMR cluster I've
spun up.
 >>I have a local pigconf directory with the Hadoop XML files and
pointed HADOOP_CONF_DIR and PIG_CLASSPATH set to it.
 >>
 >>in core-site.xml I have
 >>
 >> <property>
 >> <name>fs.default.name</name>
 >> <value>hdfs://10.116.83.74:9000</value>
 >> </property>
 >>
 >>
 >>On mapred-site.xml I have:
 >><configuration>
 >> <property>
 >> <name>mapred.job.tracker</name>
 >> <value>10.116.83.74:9001</value>
 >> </property>
 >>
 >>
 >>Now Pig tries to connect and I get
 >>2011-12-01 16:10:58,009 [main] INFO org.apache.pig.Main - Logging
error messages to:
/home/mashlogic/ayon/pigconf/pig_1322784657959.log
 >>2011-12-01 16:10:58,950 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://10.116.83.74:9000
 >>2011-12-01 16:10:59,814 [main] ERROR org.apache.pig.Main - ERROR
2999: Unexpected internal error. Failed to create DataStorage
 >>
 >>
 >>log file says:
 >>
 >>Error before Pig is launched
 >>----------------------------
 >>ERROR 2999: Unexpected internal error. Failed to create DataStorage
 >>
 >>java.lang.RuntimeException: Failed to create DataStorage
 >>at
org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
 >>at
org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58)
 >>at
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
 >>at
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
 >>at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
 >>at org.apache.pig.PigServer.<init>(PigServer.java:226)
 >>at org.apache.pig.PigServer.<init>(PigServer.java:215)
 >>at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
 >>at org.apache.pig.Main.run(Main.java:452)
 >>at org.apache.pig.Main.main(Main.java:107)
 >>Caused by: java.io <http://java.io.IO>.IOException: Call to
/10.116.83.74:9000 failed on local exception: java.io.EOFException
 >>at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142)
 >>at org.apache.hadoop.ipc.Client.call(Client.java:1110)
 >>at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
 >>at $Proxy0.getProtocolVersion(Unknown Source)
 >>at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398)
 >>at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384)
 >>at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111)
 >>at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:213)
 >>at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:180)
 >>at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
 >>at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1514)
 >>at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
 >>at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1548)
 >>at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
 >>at
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
 >>at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:111)
 >>at
org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
 >>... 9 more
 >>Caused by: java.io.EOFException
 >>at java.io
<http://java.io.DataInputStream.re>.DataInputStream.readInt(DataInputStream.java:375)
 >>at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:815)
 >>at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724)
 
>>================================================================================
 >>
 >>My EMR is running Hive jobs just fine. So if I can get it to run my
Pig jobs, I'll be happy.
 >>
 >>-Ayon
 >>See My Photos on Flickr
 >>Also check out my Blog for answers to commonly asked questions.
 >>

Re: Trying to submit Pig job to Amazon EMR

Reply via email to