Hello everyone!

I am solving the task in which every cluster node, executing spark job, should 
have access to a big external file. The file is MaxMind GeoIP database and its 
size is around 15 megabytes. MaxMind's provided library permanently uses it for 
reading with random access. Of course, it just can be stored in hdfs, but 
accessing it for random reading will be quite inefficient.

Hadoop mapreduce has DistributedCache module dedicated for this purpose. We can 
specify files in hdfs that will be required during job execution and they are 
copied to worker nodes before the job starts. So the job will efficiently 
access their copies on local machine.

I didn't found simple and effective way of doing the same thing in spark. Is 
there any preferable way to do so? 

-- 
Best regards,
Konstantin Abakumov

Reply via email to