Re: MLlib Spam example gets stuck in Stage X

2015-03-30 Thread Su She
Thank you for updating the files Holden! I actually was using that same text in my files located on HDFS. Could the files being located on HDFS be the reason why the example gets stuck? I c/p the code provided on github, the only things I changed were: a) file paths to: val spam =

Re: MLlib Spam example gets stuck in Stage X

2015-03-20 Thread Su She
Hello Xiangrui, I use spark 1.2.0 on cdh 5.3. Thanks! -Su On Fri, Mar 20, 2015 at 2:27 PM Xiangrui Meng men...@gmail.com wrote: Su, which Spark version did you use? -Xiangrui On Thu, Mar 19, 2015 at 3:49 AM, Akhil Das ak...@sigmoidanalytics.com wrote: To get these metrics out, you need

Re: MLlib Spam example gets stuck in Stage X

2015-03-20 Thread Xiangrui Meng
Su, which Spark version did you use? -Xiangrui On Thu, Mar 19, 2015 at 3:49 AM, Akhil Das ak...@sigmoidanalytics.com wrote: To get these metrics out, you need to open the driver ui running on port 4040. And in there you will see Stages information and for each stage you can see how much time

MLlib Spam example gets stuck in Stage X

2015-03-19 Thread Su She
Hello Everyone, I am trying to run this MLlib example from Learning Spark: https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48 Things I'm doing differently: 1) Using spark shell instead of an application 2) instead of

Re: MLlib Spam example gets stuck in Stage X

2015-03-19 Thread Akhil Das
Can you see where exactly it is spending time? Like you said it goes to Stage 2, then you will be able to see how much time it spend on Stage 1. See if its a GC time, then try increasing the level of parallelism or repartition it like sc.getDefaultParallelism*3. Thanks Best Regards On Thu, Mar

Re: MLlib Spam example gets stuck in Stage X

2015-03-19 Thread Su She
Hi Akhil, 1) How could I see how much time it is spending on stage 1? Or what if, like above, it doesn't get past stage 1? 2) How could I check if its a GC time? and where would I increase the parallelism for the model? I have a Spark Master and 2 Workers running on CDH 5.3...what would the

Re: MLlib Spam example gets stuck in Stage X

2015-03-19 Thread Akhil Das
To get these metrics out, you need to open the driver ui running on port 4040. And in there you will see Stages information and for each stage you can see how much time it is spending on GC etc. In your case, the parallelism seems 4, the more # of parallelism the more # of tasks you will see.