You could either distribute the small file using distributed cache - in which case, you can use direct file api to load content from the file, or directly use hdfs api's to load from each task ... usually distributed cache should work better, but ymmv !


Regards,
Mridul

On Friday 15 April 2011 09:10 AM, Aniket Mokashi wrote:
Thanks Mridul,

(Although, small might grow bigger) For instance, lets have small as
in-memory-small stored in a local file.

When does my udf load the data from the file. Earlier, I wrote a bag
loader that returns a bag of small data (eg- load 'smalldata' using
BagLoader() as (smallbag)). But then, I had to write CONTAINSBAG(hdata,
smallbag) to make this work.

I think your solution would solve my problem, but how do I make my udf
read file? Can you give me some pointers?

Thanks,
Aniket


On Thu, April 14, 2011 11:29 pm, Mridul Muralidharan wrote:


The way you described it, it does look like an application of cross.


How 'small' is small ?
If it is pretty small, you can avoid the shuffle/reduce phase and
directly stream huge through a udf which does a task local cross with
'small' (assuming it fits in memory).



%define my_udf MYUDF('smalldata')


huge = load 'mydata' as (hkey:chararray, hdata:chararray); filtered =
FILTER huge BY my_udf(hkey, hdata);




Where my_udf returns true if there exists some skey in smalldata for
which F(hdata, skey) is true - as you defined.


Regards,
Mridul


On Friday 15 April 2011 08:51 AM, Aniket Mokashi wrote:

Hi,


What would be the best way to write this script?
I have two datasets - huge (hkey, hdata), small(skey). I want to filter
all the data from huge dataset for which F(hdata, skey) is true. Please
advise.

For example,
huge = load 'mydata' as (key:chararray, value:chararray); small = load
'smalldata' as skey:chararray;
h_s_cross = cross huge, small; filtered = foreach h_s_cross generate
CONTAINS(value, skey);


Thanks,
Aniket








Reply via email to