You could either distribute the small file using distributed cache - in which case, you can use direct file api to load content from the file, or directly use hdfs api's to load from each task ... usually distributed cache should work better, but ymmv !
Regards, Mridul On Friday 15 April 2011 09:10 AM, Aniket Mokashi wrote:
Thanks Mridul, (Although, small might grow bigger) For instance, lets have small as in-memory-small stored in a local file. When does my udf load the data from the file. Earlier, I wrote a bag loader that returns a bag of small data (eg- load 'smalldata' using BagLoader() as (smallbag)). But then, I had to write CONTAINSBAG(hdata, smallbag) to make this work. I think your solution would solve my problem, but how do I make my udf read file? Can you give me some pointers? Thanks, Aniket On Thu, April 14, 2011 11:29 pm, Mridul Muralidharan wrote:The way you described it, it does look like an application of cross. How 'small' is small ? If it is pretty small, you can avoid the shuffle/reduce phase and directly stream huge through a udf which does a task local cross with 'small' (assuming it fits in memory). %define my_udf MYUDF('smalldata') huge = load 'mydata' as (hkey:chararray, hdata:chararray); filtered = FILTER huge BY my_udf(hkey, hdata); Where my_udf returns true if there exists some skey in smalldata for which F(hdata, skey) is true - as you defined. Regards, Mridul On Friday 15 April 2011 08:51 AM, Aniket Mokashi wrote:Hi, What would be the best way to write this script? I have two datasets - huge (hkey, hdata), small(skey). I want to filter all the data from huge dataset for which F(hdata, skey) is true. Please advise. For example, huge = load 'mydata' as (key:chararray, value:chararray); small = load 'smalldata' as skey:chararray; h_s_cross = cross huge, small; filtered = foreach h_s_cross generate CONTAINS(value, skey); Thanks, Aniket
