Hi,
Over the bunch of request I run using PIG 0.8.1, the most heavy one
is the following:
/* load session data from HBase */
start_sessions = load ... (start of sessions)
end_sessions = load ... (end of sessions)
location = load ... (session location)
info = load ... (session info)
/* join start and end of session */
sessions = JOIN start_sessions BY sid, end_sessions BY sid
/* remove invalid or too long sessions */
sessions = FILTER sessions BY end > start AND end - start <
MAX_SESSION_DURATION
/* Join session table with info table */
sessions = JOIN sessions BY infoid, infos BY infoid;
/* Join session table with location table */
sessions = JOIN sessions BY locid LEFT, locations BY locid;
/* Keep only required fields and format */
sessions = FOREACH sessions GENERATE ... fileds I want to keep
and need to format...;
/* store sessions in an HDFS file */
store session;
I need to optimize it, and would like your advice. Here is what I
have tried, verified.
1- this request build a plan of 3 levels
2- I've tried to use a 'merge' join for the first JOIN (since
start_sessions and end_sessions are indexed by sid). Unfortunatly,
the HBaseLoader() don't support merge JOIN.
3- I've noticed that the last M/R/ job is not correctly balanced: it
spawns 3 reduce tasks, but only 1 effectively process some data. The
location table is actually empty in this case (does this explain the
badly balanced reduce tasks?).
Any idea ?