We should add merge join support to HBaseStorage, it should be able to do that for joins on the table key.
Are your locids skewed? Have you tried using 'skewed' join for the last job? Actually, if locations are small, you can even use replicated. Any particular reason to store and load starts and ends of sessions separately? Seems like something you could put into a single HBase table row, or at least a single HBase table, and derive the starts and ends via grouping on user ids. D On Tue, Aug 23, 2011 at 9:27 AM, Vincent Barat <[email protected]>wrote: > Hi, > > Over the bunch of request I run using PIG 0.8.1, the most heavy one is the > following: > > /* load session data from HBase */ > start_sessions = load ... (start of sessions) > end_sessions = load ... (end of sessions) > location = load ... (session location) > info = load ... (session info) > > /* join start and end of session */ > sessions = JOIN start_sessions BY sid, end_sessions BY sid > > /* remove invalid or too long sessions */ > sessions = FILTER sessions BY end > start AND end - start < > MAX_SESSION_DURATION > > /* Join session table with info table */ > sessions = JOIN sessions BY infoid, infos BY infoid; > > /* Join session table with location table */ > sessions = JOIN sessions BY locid LEFT, locations BY locid; > > /* Keep only required fields and format */ > sessions = FOREACH sessions GENERATE ... fileds I want to keep and need > to format...; > > /* store sessions in an HDFS file */ > store session; > > I need to optimize it, and would like your advice. Here is what I have > tried, verified. > > 1- this request build a plan of 3 levels > 2- I've tried to use a 'merge' join for the first JOIN (since > start_sessions and end_sessions are indexed by sid). Unfortunatly, the > HBaseLoader() don't support merge JOIN. > 3- I've noticed that the last M/R/ job is not correctly balanced: it spawns > 3 reduce tasks, but only 1 effectively process some data. The location table > is actually empty in this case (does this explain the badly balanced reduce > tasks?). > > Any idea ? > > > > > > > > > >
