Pig efficiency

Thomas Foricher Tue, 30 Apr 2013 07:10:24 -0700

Hello everybody,

I'm very new on Pig and Hadoop, and I'm trying to do some efficiencytests between different technologies. For that I'm using a moviedatabase found here :


http://www.grouplens.org/taxonomy/term/14

It's a 10M entries database, it weigths about 250 Mb. I know it iscertainly not enough to see the Map/Reduce real power but I thought itcould give an idea.


So the data are formated as below :

movies.dat

    movieid::title:genres

ratings.dat

    userid::movieid::rating::timestamp

1. MySQL

My first test on MySQL was to join the two tables on movieid, and groupthem on "genres" and then compute the AVG ratings and get the TOP20 as so:


SELECT genres,  AVG( a.ratings )
FROM ratings a
JOIN movies b ON a.movieid = b.movieid
GROUP BY genres
ORDER BY AVG( a.ratings ) DESC
LIMIT 20

It tooks about two minutes to get the results.

2. Python/Pandas

I've done the same with Python and Pandas library and took about 4seconds to dump the results.


3. Hadoop/Pig

I've run a Ubuntu 12.04 64bit server release on VMware player. InstallHadoop on pseudo distributed single node cluster and pig.


Here is my complete code to get the same results as below:

I had to clean the file first since you cannot use two separators withPig as so:


movies = load 'movie/movies.dat' as (line : chararray);
moviesclean = foreach movies generate flatten(STRSPLIT(line, '::'));
STORE moviesclean INTO 'movies';

ratings = load 'movie/ratings.dat' as (line : chararray);
ratingsclean = foreach ratings generate flatten(STRSPLIT(line, '::'));
STORE ratingsclean INTO 'ratings';


movies = LOAD 'movies' as (movieid:int, title:chararray, genres:chararray);

ratings = LOAD 'ratings' as (userid:int, movieid:int, rating:float,timestamp:chararray);


X = JOIN ratings BY movieid, movies BY movieid;
Y = GROUP X BY genres;
Z = FOREACH Y GENERATE group, AVG(X.rating);
STORE Z INTO 'results';

# It took about 10 minutes to store the Z file

Z = LOAD 'results' AS (genres:chararray, avgrating:float);
Zsorted = ORDER Z BY avgrating DESC;
results = LIMIT Zsorted 20;
dump results;

# And here about 4 minutes

So in total it takes about 15 minutes to do the same analysis. Do youknow if it's normal?

Is it because the power is only revealed from terabytes of data?

Thanks

Regards.

Thomas

Pig efficiency

Reply via email to