I am developig a job that has 30B of records in the input path. (File A) I need to filter these records using another file that can have 30K to 180M of records. (File B) So fo each record in File A, i will make a lookup in File B. I am using distributed cache to share the File B. The problem is that if the File B is too large (for example 180 M of records), i spend too much time (CPU processing) allocating it in a hashmap. I make this allocation to each map task.
In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper, making the hashmap thread-safe, and sharing this read-only structure across the mappers. Is this a good approach?
