All, I'm using the Spark shell to interact with a small test deployment of Spark, built from the current master branch. I'm processing a dataset comprising a few thousand objects on Google Cloud Storage, split into a half dozen directories. My code constructs an object--let me call it the Dataset object--that defines a distinct RDD for each directory. The constructor of the object only defines the RDDs; it does not actually evaluate them, so I would expect it to return very quickly. Indeed, the logging code in the constructor prints a line signaling the completion of the code almost immediately after invocation, but the Spark shell does not show the prompt right away. Instead, it spends a few minutes seemingly frozen, eventually producing the following output:
14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to process : 9 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to process : 759 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to process : 228 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to process : 3076 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to process : 1013 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to process : 156 This stage is inexplicably slow. What could be happening? Thanks. Alex