What is the physical architecture? Do you have HBase, Elasticsearch, and Spark 
running on separate machines? If the CPU load is low then it must be IO bound 
reading from Hbase or writing to Elasticsearch. Do you have any input event 
load yet or are you making queries? These will all change the equation and are 
why separating services to run separately makes the most sense.


On May 10, 2017, at 1:47 PM, Bolmo Joosten <[email protected]> wrote:

Thanks for your suggestion. I forgot to mention in my last email that the 
$plus$plus stage takes most time (95%+) and is using only 1-3 CPUs.

I will give it a try with lower driver memory and higher executor memory.

Maybe a hard question, any idea what kind of training time I should expect with 
this data size on this cluster? 

We modified the default UR template to create the eventRDDs from CSV files 
instead of HBASE. Hbase was unable to process this amount of data on the 
cluster. This means we can't provide any personalized recommendations, but that 
is ok for now. 

2017-05-10 10:22 GMT-07:00 Pat Ferrel <[email protected] 
<mailto:[email protected]>>:
You can’t bypass HBase, you can import JSON to HBase directly so I assume this 
is what you are saying.

Executor memory should be higher and driver memory lower. Spark loves memory 
and in this case the lower limit is all your input events and BiMaps for all 
user and item ids. If you don’t have an OOM you are above minimum but 
increasing the executor mem might help, also executor CPUs. The lower limit for 
the driver mem is roughly equal to the amount per executor.

One unfortunate thing about Spark is that you can scale it to do the job in 
minutes but when you go to read or write to/from HBase or Elasticsearch this 
large a cluster will overload the DBs. So training in a long time is not all 
that bad a thing since the cluster will probably not be overloading the IO.


On May 10, 2017, at 8:45 AM, Bolmo Joosten <[email protected] 
<mailto:[email protected]>> wrote:

Hi all,

I have trouble scaling the Universal Recommender to a dataset with 250M events 
(purchase, view, atb). It trains ok on a couple of million events, but the 
training time becomes very long (>48h) on the large dataset.

Hardware specs:

Standalone cluster
20 cores (40 hyper threading)
264GB RAM
Input data size format:

We load directly from CSV files and bypass HBASE. Size of CSV is 19 GB.
PIO JSON format equivalent size: 150 GB
Train command:

pio train -- --driver-memory 64G --executor-memory 8G --executor-cores 2

I have used various variations with driver, executor memory and number of 
cores, but the training time does not seem to be affected by this.

Spark UI tells me the save method (collect > $plus$plus) in URModel.scala takes 
a very long time. See attached dumps of the Spark UI for details. 

Any suggestions?

Thanks, Bolmo








Reply via email to