Hi,

Over the bunch of request I run using PIG 0.8.1, the most heavy one is the following:

   /* load session data from HBase */
   start_sessions = load ... (start of sessions)
   end_sessions = load ... (end of sessions)
   location = load ... (session location)
   info = load ... (session info)

   /* join start and end of session */
   sessions = JOIN start_sessions BY sid, end_sessions BY sid

   /* remove invalid or too long sessions */
sessions = FILTER sessions BY end > start AND end - start < MAX_SESSION_DURATION

   /* Join session table with info table */
   sessions = JOIN sessions BY infoid, infos BY infoid;

   /* Join session table with location table */
   sessions = JOIN sessions BY locid LEFT, locations BY locid;

   /* Keep only required fields and format */
sessions = FOREACH sessions GENERATE ... fileds I want to keep and need to format...;

   /* store sessions in an HDFS file */
   store session;

I need to optimize it, and would like your advice. Here is what I have tried, verified.

1- this request build a plan of 3 levels
2- I've tried to use a 'merge' join for the first JOIN (since start_sessions and end_sessions are indexed by sid). Unfortunatly, the HBaseLoader() don't support merge JOIN. 3- I've noticed that the last M/R/ job is not correctly balanced: it spawns 3 reduce tasks, but only 1 effectively process some data. The location table is actually empty in this case (does this explain the badly balanced reduce tasks?).

Any idea ?









Reply via email to