Hi All, I have the same issue with one compressed file .tgz around 3 GB. I increase the nodes without any affect to the performance.
Best Regards, Mostafa Alaa Mohamed, Technical Expert Big Data, M: +971506450787 Email: mohamedamost...@etisalat.ae<mailto:mohamedamost...@etisalat.ae> From: Lydia Ickler [mailto:ickle...@googlemail.com] Sent: Friday, December 16, 2016 02:04 AM To: user@spark.apache.org Subject: PowerIterationClustering Benchmark Hi all, I have a question regarding the PowerIterationClusteringExample. I have adjusted the code so that it reads a file via „sc.textFile(„path/to/input“)“ which works fine. Now I wanted to benchmark the algorithm using different number of nodes to see how well the implementation scales. As a testbed I have up to 32 nodes available, each with 16 cores and Spark 2.0.2 on Yarn running. For my smallest input data set (16MB) the runtime does not really change if I use 1,2,4,8,16 or 32 nodes. (always ~ 1.5 minute) Same behavior for my largest data set (2.3GB). The runtime stays around 1h if I use 16 or if I use 32 nodes. I was expecting that when I e.g. double the number of nodes the runtime would shrink. As for setting up my cluster environment I tried different suggestions from this paper https://hal.inria.fr/hal-01347638v1/document Has someone experienced the same? Or has someone suggestions what might went wrong? Thanks in advance! Lydia ________________________________ The content of this email together with any attachments, statements and opinions expressed herein contains information that is private and confidential are intended for the named addressee(s) only. If you are not the addressee of this email you may not copy, forward, disclose or otherwise use it or any part of it in any form whatsoever. If you have received this message in error please notify postmas...@etisalat.ae by email immediately and delete the message without making any copies.