Hello

I'm trying to find a SUM of a range of fields, and am having difficulty.

I have the following data structure (from the movielens public dataset)
where there's a "fixed" field of "Name" and there's a denormalized "genres" list
(for example, the first column is "action", second is "comedy", etc.

Name|Genres
Toy Story|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
GoldenEye|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

This seems like an ideal use for the project-range feature of Pig,
where it would be trivial to find movies that belonged to two or more genres.

I'm trying to use this code:
movies = load 'movies' USING PigStorage('|');
movie_and_genres = FOREACH movies GENERATE $0, TOBAG($2 ..) AS genres;
DUMP movie_and_genres;
This works, and gives me:
(Toy 
Story,{(0),(1),(1),(1),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)})
(GoldenEye,{(1),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(1),(0),(0)})

However, if I try to run a SUM on the genres bag, I receive the error:

"Could not infer the matching function for org.apache.pig.builtin.SUM
as multiple or none of them fit. Please use an explicit cast."

I've tried to flatten and cast the genres bag like this:

movies = load 'movies' USING PigStorage('|');
movie_and_genres = FOREACH movies GENERATE $0, FLATTEN(TOBAG($2 ..));
movie_and_int_genres = FOREACH movie_and_genres GENERATE $0, (int) $1;

However, then I receive the error:
Cannot cast bytearray to int

Any ideas what to try next?  Or, would I be better off trying to use a STREAM
or custom loader to do something like this?

Thanks,
--Nate

Reply via email to