Hi All,
I am trying to use Distributed cache in my UDF. I have the following file in
HDFS that I want all my map functions to have available locally:
hadoop dfs -ls /scratch/-rw-r--r-- 1 userid supergroup size date time
/scratch/id_lookup
In My pig script I pass it as a parameter
ProcessedUI = FOREACH A GENERATE myparser.myUDF(param1, param2,
'/scratch/id_lookup');
In my UDF inside exec function I do the following:
lookup_file = (String)input.get(2);
I have implemented the getCacheFiles as follows:
public List<String> getCacheFiles() { List<String> list = new
ArrayList<String>(1); list.add(lookup_file + "#id_lookup");
return list; }
Now I try to read that file using standard io methods.
public void VectorizeData (){ FileReader fr = new
FileReader("./id_lookup"); BufferedReader brd = new
BufferedReader(fr);}
I think I am not using it correctly (may be paths messed up etc.). I get the
following exception:
2013-12-11 11:09:50,821 [JobControl] ERROR
org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException
as:userid cause:java.io.FileNotFoundException: File does not exist:
null2013-12-11 11:09:51,291 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete2013-12-11 11:09:51,301 [main] WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop
immediately on failure.
Any help on this would be great!