Thank you! Here is some output I get (Yes, I am using some early version of
GraphX)...

Stage Id DescriptionSubmittedDurationTasks: Succeeded/TotalShuffle ReadShuffle
Write25take at 
PregelNoOptionWrapperLoadFromFile.scala:108<http://idp33.almaden.ibm.com:4751/stages/stage?id=25>
2013/11/11
23:05:4834 ms
1/16
15foreach at 
GraphImpl.scala:66<http://idp33.almaden.ibm.com:4751/stages/stage?id=15>2013/11/11
23:05:41 6.9 s
16/16
1582.9 MB 23map at
GraphImpl.scala:231<http://idp33.almaden.ibm.com:4751/stages/stage?id=23>2013/11/11
23:05:1427.2 s
16/16
1920.8 MB778.1 MB24flatMap at
GraphImpl.scala:308<http://idp33.almaden.ibm.com:4751/stages/stage?id=24>
2013/11/11
23:04:5618.0 s
16/16
1746.1 MB1060.3 MB21map at
GraphImpl.scala:231<http://idp33.almaden.ibm.com:4751/stages/stage?id=21>2013/11/11
23:04:26 30.2 s
16/16
2.1 GB 869.8 MB22flatMap at
GraphImpl.scala:308<http://idp33.almaden.ibm.com:4751/stages/stage?id=22>2013/11/11
23:04:0817.6 s
16/16
1746.1 MB1142.0 MB19map at
GraphImpl.scala:231<http://idp33.almaden.ibm.com:4751/stages/stage?id=19>
2013/11/11
23:03:3928.5 s
16/16
1746.1 MB883.8 MB20flatMap at
GraphImpl.scala:308<http://idp33.almaden.ibm.com:4751/stages/stage?id=20>2013/11/11
23:03:22 17.9 s
16/16
2.3 GB 950.5 MB17map at
GraphImpl.scala:231<http://idp33.almaden.ibm.com:4751/stages/stage?id=17>2013/11/11
23:02:5427.3 s
16/16
979.2 MB750.1 MB18flatMap at
GraphImpl.scala:308<http://idp33.almaden.ibm.com:4751/stages/stage?id=18>
2013/11/11
23:02:4113.6 s
16/16
350.6 MB


One thing I am quite curious is that, Shuffle Read Number seems to be much
larger than the Shuffle Write Number (a factor of 2-3). I am wondering if
this is normal?

I have checkpoint some RDD in the middle, would this result in reading
directly from the checkpointed RDD?

Thank you!

Best,
Wenlei


On Mon, Nov 11, 2013 at 9:50 AM, Matei Zaharia <[email protected]>wrote:

> Yes, just look at the application UI on http://<driver-node>:4040
>
> Matei
>
> On Nov 11, 2013, at 12:26 AM, Wenlei Xie <[email protected]> wrote:
>
> > Hi,
> >
> > I have some shuffling task which is supposed to have may repeated
> values, thus I assume the shuffling compress would help the performance .
> >
> > However I get very similar running time whether I set
> spark.shuffle.compress to be true/false. I would like to know whether it's
> because my data cannot be compressed or not. Is there any way to monitor
> the data shuffled for one transformation?
> >
> > Best,
> > Wenlei
>
>


-- 
Wenlei Xie (谢文磊)

Department of Computer Science
5132 Upson Hall, Cornell University
Ithaca, NY 14853, USA
Phone: (607) 255-5577
Email: [email protected]

Reply via email to