I'm running into an issue with pig 0.9.1. My top-level data directory
contains several files and directories with restricted permissions, and
my LoadFunc and input format ignore these directories if the user does
not have permission to read them. Unfortunately pig's execution engine
still throws an exception.
Example:
$ hadoop fs -ls /data
Found 2 items
drwxr-xr-x - owner users 0 2011-11-16 06:47 /data/readable
drwxr-x--- - owner secure 0 2011-11-16 06:48 /data/secure
The /data/secure directory is readable only by users in the 'secure'
group. Non-secure users encounter the following pig exception even
though the loader and input format do not touch secure data:
REGISTER my-jar;
data = LOAD /data USING myLoader();
(do something..)
Caused by: org.apache.hadoop.security.AccessControlException:
org.apache.hadoop.security.AccessControlException: Permission denied:
user=<removed>, access=READ_EXECUTE, inode="secure":owner:secure:rwxr-x---
at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:669)
at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:280)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getPathLength(JobControlCompiler.java:791)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getPathLength(JobControlCompiler.java:794)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getTotalInputFileSize(JobControlCompiler.java:779)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.estimateNumberOfReducers(JobControlCompiler.java:739)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:587)
... 12 more
I think Pig should probably catch this exception and ignore unreadable
directories when estimating the number of reducers.
Thanks,
--Adam