Hi folks,
Not sure if this is related to Pig or Hadoop in general; but I'm
posting this here since I'm running Pig scripts :)
Anyway, I've been trying to perform a CROSS join between 2 files which
results in ~1 billion records. My Hadoop cluster has 4 data nodes.
The namenode also serves as one of the data nodes as well (not
recommended, but haven't had time to reconfigure this yet :P). After
executing the Pig script, it threw the following exception at around
80+%:
java.io.IOException: org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
replicated yet:/user/root/out/_tempora
ry/_attempt_201201091651_0001_r_000001_3/part-r-00001
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1517)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:685)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:464)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:427)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:399)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:261)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Pig script shown below:
============================================================
set job.name 'vac cross 2';
set default_parallel 10;
register lib/*.jar;
define DIST com.pig.udf.Distance();
js = load 'js.csv' using PigStorage(',') as (ic:chararray, jsstate:chararray);
vac = load 'vac.csv' using PigStorage(',') as (id:chararray,
vacstate:chararray);
cx = cross js, vac;
d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, vacstate);
store d into 'out' using PigStorage(',');
============================================================
Any help is greatly appreciated.
Thanks!