Hi,

While building Recommendation engine using spark MLlib (ALS) we are facing some 
issues during execution.

Details are below.

We are trying to train our model on 1.4 million sparse rating records (1,00, 
000 customer X 50,000 items). The execution DAG cycle is taking a long time and 
is crashing after several hours when executing 
model.recommendProductsForUsers() step . The causes of exception are 
non-uniform and varied from time to time.

The common exceptions faced during last 10 runs are

a)      Akka Timeout

b)      Out of Memory Exceptions

c)       Executor disassociation.

We have tried increasing execution time to 1200 seconds, that doesn't seem to 
create an impact
       sparkConf.set("spark.network.timeout", "1200s");
       sparkConf.set("spark.rpc.askTimeout", "1200s");
       sparkConf.set("spark.rpc.lookupTimeout", "1200s");
       sparkConf.set("spark.akka.timeout", "1200s");

                Our command line parameters are as follows --num-executors 5 
--executor-memory 2G --conf spark.yarn.executor.memoryOverhead=600 --conf 
spark.default.parallelism=500 --master yarn

                Configuration

1.       3 node cluster,  16 GB RAM, Intel I7 processor.

2.       Spark 1.5.2

                The algorithm is perfectly working for lesser number of records.

We would appreciate any help in this regard and would like to know following

1.       How can we handle execution of large records in spark without fail, as 
the rating records will increase with time.

2.       Are we missing any command line parameters that are necessary for this 
type of heavy execution.

3.       Does above cluster size and configuration adequate for this many 
record processing?  Large amount of time taken during execution is fine, but 
the process should not Fail.

4.       What is exactly meant by Akka timeout error during ALS job execution ?

Regards,
Pankaj Rawat

Reply via email to