Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

Ewan Higgs Mon, 07 Dec 2015 05:45:44 -0800

Jonathan,

Did you ever get to the bottom of this? I have some users working withSpark in a classroom setting and our example notebooks run into problemswhere there is so much spilled to disk that they run out of quota. A1.5G input set becomes >30G of spilled data on disk. I looked into how Icould unpersist the data so I could clean up the files, but I wasunsuccessful.


We're using Spark 1.5.0

Yours,
Ewan

On 16/07/15 23:18, Stahlman, Jonathan wrote:

Hello all,
I am running the Spark recommendation algorithm in MLlib and I havebeen studying its output with various model configurations. Ideally Iwould like to be able to run one job that trains the recommendationmodel with many different configurations to try to optimize forperformance. A sample code in python is copied below.
The issue I have is that each new model which is trained caches a setof RDDs and eventually the executors run out of memory. Is there anyway in Pyspark to unpersist() these RDDs after each iteration? Thenames of the RDDs which I gather from the UI is:
itemInBlocks
itemOutBlocks
Products
ratingBlocks
userInBlocks
userOutBlocks
users

I am using Spark 1.3.  Thank you for any help!

Regards,
Jonathan




data_train, data_cv, data_test = data.randomSplit([99,1,1], 2)
  functions = [rating] #defined elsewhere
  ranks = [10,20]
  iterations = [10,20]
  lambdas = [0.01,0.1]
  alphas  = [1.0,50.0]

  results = []
for ratingFunction, rank, numIterations, m_lambda, m_alpha initertools.product( functions, ranks, iterations, lambdas, alphas ):
    #train model
ratings_train = data_train.map(lambda l: Rating( l.user,l.product, ratingFunction(l) ) )model = ALS.trainImplicit( ratings_train, rank, numIterations,lambda_=float(m_lambda), alpha=float(m_alpha) )
    #test performance on CV data
ratings_cv = data_cv.map(lambda l: Rating( l.uesr, l.product,ratingFunction(l) ) )
    auc = areaUnderCurve( ratings_cv, model.predictAll )

    #save results
result = ",".join(str(l) for l in[ratingFunction.__name__,rank,numIterations,m_lambda,m_alpha,auc])
    results.append(result)

------------------------------------------------------------------------
The information contained in this e-mail is confidential and/orproprietary to Capital One and/or its affiliates and may only be usedsolely in performance of work or services for Capital One. Theinformation transmitted herewith is intended only for use by theindividual or entity to which it is addressed. If the reader of thismessage is not the intended recipient, you are hereby notified thatany review, retransmission, dissemination, distribution, copying orother use of, or taking of any action in reliance upon thisinformation is strictly prohibited. If you have received thiscommunication in error, please contact the sender and delete thematerial from your computer.

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

Reply via email to