Apache Pig UDF and Distributed cache

Sameer Tilak Wed, 11 Dec 2013 11:13:57 -0800

Hi All,
I am trying to use Distributed cache in my UDF. I have the following file in 
HDFS that I want all my map functions to have available locally:
hadoop dfs -ls /scratch/-rw-r--r--   1 userid supergroup    size date time 
/scratch/id_lookup
In My pig script I pass it as a parameter


ProcessedUI = FOREACH A GENERATE myparser.myUDF(param1, param2, 
'/scratch/id_lookup');
In my UDF inside exec function I do the following:
 lookup_file = (String)input.get(2);
I have implemented the getCacheFiles as follows:
public List<String> getCacheFiles() {            List<String> list = new 
ArrayList<String>(1);            list.add(lookup_file + "#id_lookup");          
  return list;  }
Now I try to read that file using standard io methods.
public void VectorizeData (){                    FileReader fr = new 
FileReader("./id_lookup");                    BufferedReader brd = new 
BufferedReader(fr);}

I think I am not using it correctly (may be paths messed up etc.). I get the 
following exception:
2013-12-11 11:09:50,821 [JobControl] ERROR 
org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException 
as:userid cause:java.io.FileNotFoundException: File does not exist: 
null2013-12-11 11:09:51,291 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 0% complete2013-12-11 11:09:51,301 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop 
immediately on failure.
Any help on this would be great!

Apache Pig UDF and Distributed cache

Reply via email to