Hi, While building Recommendation engine using spark MLlib (ALS) we are facing some issues during execution.
Details are below. We are trying to train our model on 1.4 million sparse rating records (1,00, 000 customer X 50,000 items). The execution DAG cycle is taking a long time and is crashing after several hours when executing model.recommendProductsForUsers() step . The causes of exception are non-uniform and varied from time to time. The common exceptions faced during last 10 runs are a) Akka Timeout b) Out of Memory Exceptions c) Executor disassociation. We have tried increasing execution time to 1200 seconds, that doesn't seem to create an impact sparkConf.set("spark.network.timeout", "1200s"); sparkConf.set("spark.rpc.askTimeout", "1200s"); sparkConf.set("spark.rpc.lookupTimeout", "1200s"); sparkConf.set("spark.akka.timeout", "1200s"); Our command line parameters are as follows --num-executors 5 --executor-memory 2G --conf spark.yarn.executor.memoryOverhead=600 --conf spark.default.parallelism=500 --master yarn Configuration 1. 3 node cluster, 16 GB RAM, Intel I7 processor. 2. Spark 1.5.2 The algorithm is perfectly working for lesser number of records. We would appreciate any help in this regard and would like to know following 1. How can we handle execution of large records in spark without fail, as the rating records will increase with time. 2. Are we missing any command line parameters that are necessary for this type of heavy execution. 3. Does above cluster size and configuration adequate for this many record processing? Large amount of time taken during execution is fine, but the process should not Fail. 4. What is exactly meant by Akka timeout error during ALS job execution ? Regards, Pankaj Rawat