Thank you! Here is some output I get (Yes, I am using some early version of GraphX)...
Stage Id DescriptionSubmittedDurationTasks: Succeeded/TotalShuffle ReadShuffle Write25take at PregelNoOptionWrapperLoadFromFile.scala:108<http://idp33.almaden.ibm.com:4751/stages/stage?id=25> 2013/11/11 23:05:4834 ms 1/16 15foreach at GraphImpl.scala:66<http://idp33.almaden.ibm.com:4751/stages/stage?id=15>2013/11/11 23:05:41 6.9 s 16/16 1582.9 MB 23map at GraphImpl.scala:231<http://idp33.almaden.ibm.com:4751/stages/stage?id=23>2013/11/11 23:05:1427.2 s 16/16 1920.8 MB778.1 MB24flatMap at GraphImpl.scala:308<http://idp33.almaden.ibm.com:4751/stages/stage?id=24> 2013/11/11 23:04:5618.0 s 16/16 1746.1 MB1060.3 MB21map at GraphImpl.scala:231<http://idp33.almaden.ibm.com:4751/stages/stage?id=21>2013/11/11 23:04:26 30.2 s 16/16 2.1 GB 869.8 MB22flatMap at GraphImpl.scala:308<http://idp33.almaden.ibm.com:4751/stages/stage?id=22>2013/11/11 23:04:0817.6 s 16/16 1746.1 MB1142.0 MB19map at GraphImpl.scala:231<http://idp33.almaden.ibm.com:4751/stages/stage?id=19> 2013/11/11 23:03:3928.5 s 16/16 1746.1 MB883.8 MB20flatMap at GraphImpl.scala:308<http://idp33.almaden.ibm.com:4751/stages/stage?id=20>2013/11/11 23:03:22 17.9 s 16/16 2.3 GB 950.5 MB17map at GraphImpl.scala:231<http://idp33.almaden.ibm.com:4751/stages/stage?id=17>2013/11/11 23:02:5427.3 s 16/16 979.2 MB750.1 MB18flatMap at GraphImpl.scala:308<http://idp33.almaden.ibm.com:4751/stages/stage?id=18> 2013/11/11 23:02:4113.6 s 16/16 350.6 MB One thing I am quite curious is that, Shuffle Read Number seems to be much larger than the Shuffle Write Number (a factor of 2-3). I am wondering if this is normal? I have checkpoint some RDD in the middle, would this result in reading directly from the checkpointed RDD? Thank you! Best, Wenlei On Mon, Nov 11, 2013 at 9:50 AM, Matei Zaharia <[email protected]>wrote: > Yes, just look at the application UI on http://<driver-node>:4040 > > Matei > > On Nov 11, 2013, at 12:26 AM, Wenlei Xie <[email protected]> wrote: > > > Hi, > > > > I have some shuffling task which is supposed to have may repeated > values, thus I assume the shuffling compress would help the performance . > > > > However I get very similar running time whether I set > spark.shuffle.compress to be true/false. I would like to know whether it's > because my data cannot be compressed or not. Is there any way to monitor > the data shuffled for one transformation? > > > > Best, > > Wenlei > > -- Wenlei Xie (谢文磊) Department of Computer Science 5132 Upson Hall, Cornell University Ithaca, NY 14853, USA Phone: (607) 255-5577 Email: [email protected]
