Jonathan,
Did you ever get to the bottom of this? I have some users working with
Spark in a classroom setting and our example notebooks run into problems
where there is so much spilled to disk that they run out of quota. A
1.5G input set becomes >30G of spilled data on disk. I looked into how I
could unpersist the data so I could clean up the files, but I was
unsuccessful.
We're using Spark 1.5.0
Yours,
Ewan
On 16/07/15 23:18, Stahlman, Jonathan wrote:
Hello all,
I am running the Spark recommendation algorithm in MLlib and I have
been studying its output with various model configurations. Ideally I
would like to be able to run one job that trains the recommendation
model with many different configurations to try to optimize for
performance. A sample code in python is copied below.
The issue I have is that each new model which is trained caches a set
of RDDs and eventually the executors run out of memory. Is there any
way in Pyspark to unpersist() these RDDs after each iteration? The
names of the RDDs which I gather from the UI is:
itemInBlocks
itemOutBlocks
Products
ratingBlocks
userInBlocks
userOutBlocks
users
I am using Spark 1.3. Thank you for any help!
Regards,
Jonathan
data_train, data_cv, data_test = data.randomSplit([99,1,1], 2)
functions = [rating] #defined elsewhere
ranks = [10,20]
iterations = [10,20]
lambdas = [0.01,0.1]
alphas = [1.0,50.0]
results = []
for ratingFunction, rank, numIterations, m_lambda, m_alpha in
itertools.product( functions, ranks, iterations, lambdas, alphas ):
#train model
ratings_train = data_train.map(lambda l: Rating( l.user,
l.product, ratingFunction(l) ) )
model = ALS.trainImplicit( ratings_train, rank, numIterations,
lambda_=float(m_lambda), alpha=float(m_alpha) )
#test performance on CV data
ratings_cv = data_cv.map(lambda l: Rating( l.uesr, l.product,
ratingFunction(l) ) )
auc = areaUnderCurve( ratings_cv, model.predictAll )
#save results
result = ",".join(str(l) for l in
[ratingFunction.__name__,rank,numIterations,m_lambda,m_alpha,auc])
results.append(result)
------------------------------------------------------------------------
The information contained in this e-mail is confidential and/or
proprietary to Capital One and/or its affiliates and may only be used
solely in performance of work or services for Capital One. The
information transmitted herewith is intended only for use by the
individual or entity to which it is addressed. If the reader of this
message is not the intended recipient, you are hereby notified that
any review, retransmission, dissemination, distribution, copying or
other use of, or taking of any action in reliance upon this
information is strictly prohibited. If you have received this
communication in error, please contact the sender and delete the
material from your computer.