Hi all, I have a very simple M/r scenario -join subscribers records (50 M records/20TB) with subscriber events (1 B records/5TB). The goal is to update the subscribers records with incoming events.
Few possible solutions: 1. Reduce side join. In map, omit subscriber id as key. Reduce will get the subscriber record + events and application code can update the record using the events. 2. Arrange the records/events in smaller files (by subscriber modulo) invoke multiple M/R jobs on each per. e.g. records_m_1*events_m_1, records_m_2*events_m_2. Same logic like in 1, but now working on smaller files with better correlation. 3. Use Composite Join pattern, which means that join will be in Map side saving additional Read/Write operation. On the other hand it means that events files should be sorted (so that another M/R job). I'd appreciate the forum feedback on the 3 alternatives. Thanks, Lior
