MultithreadedMapper - Sharing Data Structure

Pedro Magalhaes Sat, 22 Aug 2015 16:38:46 -0700

I am developig a job that has 30B of records in the input path. (File A)
I need to filter these records using another file that can have 30K to 180M
of records. (File B)
So fo each record in File A, i will make a lookup in File B.
I am using distributed cache to share the File B. The problem is that if
the File B is too large (for example 180 M of records), i spend too much
time (CPU processing) allocating it in a hashmap. I make this allocation to
each map task.


In hadoop 2.X the jvm reuse was discontinued. So i am think in use
MultithreadedMapper,
making the hashmap thread-safe, and sharing this read-only structure across
the mappers.

Is this a good approach?

MultithreadedMapper - Sharing Data Structure

Reply via email to