Fwd: PIO UR model issue in training

Pat Ferrel Fri, 07 Jul 2017 12:37:54 -0700

Question was cross-posted

Begin forwarded message:

From: Pat Ferrel <[email protected]>
Subject: Re: PIO UR model issue in training
Date: July 7, 2017 at 12:35:45 PM PDT
To: [email protected]
Cc: actionml-user <[email protected]>

Your memory size and cores seem far too low for the data size. Spark gets its 
speed from the fact that all data including intermediate results are in-memory 
somewhere in the cluster as they are being used. The driver memory should be 
almost the same size as executor memory (can be less).

I have never seen a case where the physical architecture was not influenced 
heavily by the data size and yours is large. We have deployments that get that 
much data every day and training is 1.5hrs. So the system must be scaled right.

Also if you are using Yarn, is this on a shared Cluster? We do not recommend 
this since other jobs may be allocated resources that affect your execution 
time. Sharing an analytics cluster with something that is a business 
requirement (recommendations) can be problematic. We tend to favor spinning up 
a dedicated Spark cluster with as many nodes as you need (we have tools to do 
this on AWS) and after training is done, stop them so you don’t pay anything 
for them when not in use. With this system training times become quite nicely 
predictable and arbitrarily short at minimal cost.

On Jul 7, 2017, at 11:51 AM, [email protected] 
<mailto:[email protected]> wrote:

I am getting issues in the  training of  Universal Recommender model.

UR model is reading 3 kind of events (purchase,view and atb) from the tsv file.
Model create 3 RDDs. The following is the size of RDD

2017-07-06 17:02:44,797 INFO  com.macys.ur.flexible.DataSource [main] - 
Partitions after zip end of DataSource: 32
2017-07-06 17:02:44,798 INFO  com.macys.ur.flexible.DataSource [main] - 
Received events List(purchase, view, atb)
2017-07-06 17:03:52,773 INFO  com.macys.ur.flexible.DataSource [main] - Number 
of events List(68032180, 7551743, 196947013)

Before dumping into the elastic search the jobs seems stuck and dont even 
finish. Taking 23 hours or more 

I been playing with the driver memory and number of executors but nothing is 
helping  

Command to submit the job 
nohup pio train -- --master yarn --conf="spark.driver.memory=32G" 
--executor-memory 12G --executor-cores 2 --num-executors 16 
>/data/logs/flexible-ur/train2017-07-07-small.log 2>&1 &

Please provide some guidance on the issue.

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected] 
<mailto:[email protected]>.
To post to this group, send email to [email protected] 
<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/feca2f6e-cc0a-4f06-80c7-061ae8f0eb82%40googlegroups.com

<https://groups.google.com/d/msgid/actionml-user/feca2f6e-cc0a-4f06-80c7-061ae8f0eb82%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout 
<https://groups.google.com/d/optout>.

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected] 
<mailto:[email protected]>.
To post to this group, send email to [email protected] 
<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/BD5D5EFD-A42B-49F5-8073-44F4A7BD4E75%40occamsmachete.com

<https://groups.google.com/d/msgid/actionml-user/BD5D5EFD-A42B-49F5-8073-44F4A7BD4E75%40occamsmachete.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout 
<https://groups.google.com/d/optout>.

Fwd: PIO UR model issue in training

Reply via email to