Re: Execution error during ALS execution in spark

2016-04-01 Thread pankajrawat
Thanks for suggestion, but our application is still crashing 

*Description: * flatMap at MatrixFactorizationModel.scala:278

*Failure Reason: * Job aborted due to stage failure: Task 1 in stage 6.0
failed 4 times, most recent failure: Lost task 1.3 in stage 6.0 (TID 116,
dev.local): ExecutorLostFailure (executor 11 lost)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Execution-error-during-ALS-execution-in-spark-tp26644p26659.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Execution error during ALS execution in spark

2016-03-31 Thread pankajrawat
Hi, 

While building Recommendation engine using spark MLlib (ALS) we are facing
some issues during execution. 

Details are below. 

We are trying to train our model on 1.4 million sparse rating records (1,00,
000 customer X 50,000 items). The execution DAG cycle is taking a long time
and is crashing after several hours when executing
model.recommendProductsForUsers() step . The causes of exception are
non-uniform and varied from time to time. 
The common exceptions faced during last 10 runs are 
a)  Akka Timeout
b)  Out of Memory Exceptions
c)  Executor disassociation.

We have tried increasing execution time to 1200 seconds, that doesn’t seem
to create an impact
   sparkConf.set("spark.network.timeout", "1200s");
   sparkConf.set("spark.rpc.askTimeout", "1200s");
   sparkConf.set("spark.rpc.lookupTimeout", "1200s");
   sparkConf.set("spark.akka.timeout", "1200s"); 

Our command line parameters are as follows --num-executors 5
--executor-memory 2G --conf spark.yarn.executor.memoryOverhead=600 --conf
spark.default.parallelism=500 --master yarn

Configuration
1.  3 node cluster,  16 GB RAM, Intel I7 processor. 
2.  Spark 1.5.2

The algorithm is perfectly working for lesser number of
records.

We would appreciate any help in this regard and would like to know following 
1.  How can we handle execution of large records in spark without fail, as
the rating records will increase with time. 
2.  Are we missing any command line parameters that are necessary for this
type of heavy execution.
3.  Does above cluster size and configuration adequate for this many record
processing?  Large amount of time taken during execution is fine, but the
process should not Fail. 
4.  What is exactly meant by Akka timeout error during ALS job execution ? 

Regards,
Pankaj Rawat




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Execution-error-during-ALS-execution-in-spark-tp26644.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org