Re: Question about bags and UDFs

Alan Gates Thu, 21 Apr 2011 13:17:16 -0700

Starting with Pig 0.9 (not yet released but you can build it off thebranch) a UDF can specify a file to put in the distributed cache. Youcould thus have your UDF pick up the file locally on your box and putit in the distributed cache, and then read it from the distributedcache on the back end. If running with an un-released version isn'tan option for you, you could manually load the file into thedistributed cache and then read it from your UDF.


Alan.


On Apr 21, 2011, at 8:18 AM, Mark Laczin wrote:

Does anyone know how to ship the config file in this situation?
I'm encountering problems with file not found exceptions when tryingto run
this over a cluster.
On Wed, Apr 20, 2011 at 1:03 PM, Mark Laczin <[email protected]>wrote:
I kind of solved it by reading in the data from my UDF constructor(it'sjust a file with a list of like 10 regular expressions, so I didmanual fileI/O), by passing the path (provided as a parameter), and then juststoringit (and then, looping over it and testing a, b by hand). It's nottheMapReduce way, but it will work for this application, consideringthe small
size of the file.
If anyone knows how my "patch" might fail, or if there is a betterway -
feel free to speak up.

-Mark
On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham<[email protected]>wrote:
You could try doing GROUP ALL on the contents of M, which would
produce a since bag containing each record and then joining M with
data using a surrogate constant key. Or CROSS would also workinsteadof the join I suspect. Then you'd have a tuple like this to workwith:
(a, b, M:bag)

I'm not sure if things would blow up if M is too large to fit into
memory in your UDF though.


On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <[email protected]>
wrote:
I'm trying to do something like this:
(if 'data' is a set of tuples loaded from a file containingfields a, b
and
c)
(if 'M' is another set of tuples loaded from a file)

data = FOREACH data GENERATE *, someUDF(a, b, M);
What I'm looking for is to generate (in this case, a string)based on a
and
b, using the contents of M inside the UDF.

The UDF looks like this, in pseudocode:

foreach element x in M {
if a matches x or b matches x {
  return "something"
}
}
return "something else"
Is this possible? I keep getting errors related to "Scalars canonly be
used with projections" and the like.
The thing holding me back from using filters is that I won't knowwhat's
in
M until it's read, and since (in this case) they'll be regular
expressions,
I'd need to be able to join/group with regex matching which I don't
think
Pig can do.

-Mark

Re: Question about bags and UDFs

Reply via email to