RE: Loading objects only once

JG Perrin Thu, 28 Sep 2017 04:45:30 -0700

Maybe load the model on each executor’s disk and load it from there? Depending 
on how you use the data/model, using something like Livy and sharing the same 
connection may help?

From: Naveen Swamy [mailto:mnnav...@gmail.com]
Sent: Wednesday, September 27, 2017 9:08 PM
To: user@spark.apache.org
Subject: Loading objects only once

Hello all,

I am a new user to Spark, please bear with me if this has been discussed 
earlier.

I am trying to run batch inference using DL frameworks pre-trained models and 
Spark. Basically, I want to download a model(which is usually ~500 MB) onto the 
workers and load the model and run inference on images fetched from the source 
like S3 something like this
rdd = sc.parallelize(load_from_s3)
rdd.map(fetch_from_s3).map(read_file).map(predict)

I was able to get it running in local mode on Jupyter, However, I would like to 
load the model only once and not every map operation. A setup hook would have 
nice which loads the model once into the JVM, I came across this JIRA 
https://issues.apache.org/jira/browse/SPARK-650  which suggests that I can use 
Singleton and static initialization. I tried to do this using a Singleton 
metaclass following the thread here 
https://stackoverflow.com/questions/6760685/creating-a-singleton-in-python. 
Following this failed miserably complaining that Spark cannot serialize ctype 
objects with pointer references.

After a lot of trial and error, I moved the code to a separate file by creating 
a static method for predict that checks if a class variable is set or not and 
loads the model if not set. This approach does not sound thread safe to me, So 
I wanted to reach out and see if there are established patterns on how to 
achieve something like this.

Also, I would like to understand the executor->tasks->python process mapping, 
Does each task gets mapped to a separate python process?  The reason I ask is I 
want to be to use mapPartition method to load a batch of files and run 
inference on them separately for which I need to load the object once per task. 
Any

Thanks for your time in answering my question.

Cheers, Naveen

RE: Loading objects only once

Reply via email to