Le 23/08/11 20:28, Dmitriy Ryaboy a écrit :
We should add merge join support to HBaseStorage, it should be able to do
that for joins on the table key.
It would be great !
Are your locids skewed? Have you tried using 'skewed' join for the last job?
Actually, if locations are small, you can even use replicated.
Unfortunately not (our locid are MD5 hashcodes)
Any particular reason to store and load starts and ends of sessions
separately? Seems like something you could put into a single HBase table
row, or at least a single HBase table, and derive the starts and ends via
grouping on user ids.
Historical reasons only. Yes I'm thinking about how to change this.
Actually locations and infos are small enough to fit into memory, so
I've used replicated joins and it help a lot (X4 times in our case).
So, using a merge join for the firt JOIN would definitively solve my
issue.
Thanks for your help.
D
On Tue, Aug 23, 2011 at 9:27 AM, Vincent Barat<[email protected]>wrote:
Hi,
Over the bunch of request I run using PIG 0.8.1, the most heavy one is the
following:
/* load session data from HBase */
start_sessions = load ... (start of sessions)
end_sessions = load ... (end of sessions)
location = load ... (session location)
info = load ... (session info)
/* join start and end of session */
sessions = JOIN start_sessions BY sid, end_sessions BY sid
/* remove invalid or too long sessions */
sessions = FILTER sessions BY end> start AND end - start<
MAX_SESSION_DURATION
/* Join session table with info table */
sessions = JOIN sessions BY infoid, infos BY infoid;
/* Join session table with location table */
sessions = JOIN sessions BY locid LEFT, locations BY locid;
/* Keep only required fields and format */
sessions = FOREACH sessions GENERATE ... fileds I want to keep and need
to format...;
/* store sessions in an HDFS file */
store session;
I need to optimize it, and would like your advice. Here is what I have
tried, verified.
1- this request build a plan of 3 levels
2- I've tried to use a 'merge' join for the first JOIN (since
start_sessions and end_sessions are indexed by sid). Unfortunatly, the
HBaseLoader() don't support merge JOIN.
3- I've noticed that the last M/R/ job is not correctly balanced: it spawns
3 reduce tasks, but only 1 effectively process some data. The location table
is actually empty in this case (does this explain the badly balanced reduce
tasks?).
Any idea ?