I'm not sure if this is available in Python but from 1.3 on you should
be able to call ALS.setFinalRDDStorageLevel with level "none" to ask
it to unpersist when it is done.

On Mon, Dec 7, 2015 at 1:42 PM, Ewan Higgs <ewan.hi...@ugent.be> wrote:
> Jonathan,
> Did you ever get to the bottom of this? I have some users working with Spark
> in a classroom setting and our example notebooks run into problems where
> there is so much spilled to disk that they run out of quota. A 1.5G input
> set becomes >30G of spilled data on disk. I looked into how I could
> unpersist the data so I could clean up the files, but I was unsuccessful.
>
> We're using Spark 1.5.0
>
> Yours,
> Ewan
>
> On 16/07/15 23:18, Stahlman, Jonathan wrote:
>
> Hello all,
>
> I am running the Spark recommendation algorithm in MLlib and I have been
> studying its output with various model configurations.  Ideally I would like
> to be able to run one job that trains the recommendation model with many
> different configurations to try to optimize for performance.  A sample code
> in python is copied below.
>
> The issue I have is that each new model which is trained caches a set of
> RDDs and eventually the executors run out of memory.  Is there any way in
> Pyspark to unpersist() these RDDs after each iteration?  The names of the
> RDDs which I gather from the UI is:
>
> itemInBlocks
> itemOutBlocks
> Products
> ratingBlocks
> userInBlocks
> userOutBlocks
> users
>
> I am using Spark 1.3.  Thank you for any help!
>
> Regards,
> Jonathan
>
>
>
>
>   data_train, data_cv, data_test = data.randomSplit([99,1,1], 2)
>   functions = [rating] #defined elsewhere
>   ranks = [10,20]
>   iterations = [10,20]
>   lambdas = [0.01,0.1]
>   alphas  = [1.0,50.0]
>
>   results = []
>   for ratingFunction, rank, numIterations, m_lambda, m_alpha in
> itertools.product( functions, ranks, iterations, lambdas, alphas ):
>     #train model
>     ratings_train = data_train.map(lambda l: Rating( l.user, l.product,
> ratingFunction(l) ) )
>     model   = ALS.trainImplicit( ratings_train, rank, numIterations,
> lambda_=float(m_lambda), alpha=float(m_alpha) )
>
>     #test performance on CV data
>     ratings_cv = data_cv.map(lambda l: Rating( l.uesr, l.product,
> ratingFunction(l) ) )
>     auc = areaUnderCurve( ratings_cv, model.predictAll )
>
>     #save results
>     result = ",".join(str(l) for l in
> [ratingFunction.__name__,rank,numIterations,m_lambda,m_alpha,auc])
>     results.append(result)
>
> ________________________________
>
> The information contained in this e-mail is confidential and/or proprietary
> to Capital One and/or its affiliates and may only be used solely in
> performance of work or services for Capital One. The information transmitted
> herewith is intended only for use by the individual or entity to which it is
> addressed. If the reader of this message is not the intended recipient, you
> are hereby notified that any review, retransmission, dissemination,
> distribution, copying or other use of, or taking of any action in reliance
> upon this information is strictly prohibited. If you have received this
> communication in error, please contact the sender and delete the material
> from your computer.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to