Starting with Pig 0.9 (not yet released but you can build it off the
branch) a UDF can specify a file to put in the distributed cache. You
could thus have your UDF pick up the file locally on your box and put
it in the distributed cache, and then read it from the distributed
cache on the back end. If running with an un-released version isn't
an option for you, you could manually load the file into the
distributed cache and then read it from your UDF.
Alan.
On Apr 21, 2011, at 8:18 AM, Mark Laczin wrote:
Does anyone know how to ship the config file in this situation?
I'm encountering problems with file not found exceptions when trying
to run
this over a cluster.
On Wed, Apr 20, 2011 at 1:03 PM, Mark Laczin <[email protected]>
wrote:
I kind of solved it by reading in the data from my UDF constructor
(it's
just a file with a list of like 10 regular expressions, so I did
manual file
I/O), by passing the path (provided as a parameter), and then just
storing
it (and then, looping over it and testing a, b by hand). It's not
the
MapReduce way, but it will work for this application, considering
the small
size of the file.
If anyone knows how my "patch" might fail, or if there is a better
way -
feel free to speak up.
-Mark
On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham
<[email protected]>wrote:
You could try doing GROUP ALL on the contents of M, which would
produce a since bag containing each record and then joining M with
data using a surrogate constant key. Or CROSS would also work
instead
of the join I suspect. Then you'd have a tuple like this to work
with:
(a, b, M:bag)
I'm not sure if things would blow up if M is too large to fit into
memory in your UDF though.
On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <[email protected]>
wrote:
I'm trying to do something like this:
(if 'data' is a set of tuples loaded from a file containing
fields a, b
and
c)
(if 'M' is another set of tuples loaded from a file)
data = FOREACH data GENERATE *, someUDF(a, b, M);
What I'm looking for is to generate (in this case, a string)
based on a
and
b, using the contents of M inside the UDF.
The UDF looks like this, in pseudocode:
foreach element x in M {
if a matches x or b matches x {
return "something"
}
}
return "something else"
Is this possible? I keep getting errors related to "Scalars can
only be
used with projections" and the like.
The thing holding me back from using filters is that I won't know
what's
in
M until it's read, and since (in this case) they'll be regular
expressions,
I'd need to be able to join/group with regex matching which I don't
think
Pig can do.
-Mark