Oh, it makes sense of gsutil scans through this quickly, but I was wondering if running a Hadoop job / bdutil would result in just as fast scans?
On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta <alexbare...@gmail.com> wrote: > Denny, > > No, gsutil scans through the listing of the bucket quickly. See the > following. > > alex@hadoop-m:~/split$ time bash -c "gsutil ls > gs://my-bucket/20141205/csv/*/*/* | wc -l" > > 6860 > > real 0m6.971s > user 0m1.052s > sys 0m0.096s > > Alex > > > On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee <denny.g....@gmail.com> wrote: >> >> I'm curious if you're seeing the same thing when using bdutil against >> GCS? I'm wondering if this may be an issue concerning the transfer rate of >> Spark -> Hadoop -> GCS Connector -> GCS. >> >> >> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta < >> alexbare...@gmail.com> wrote: >> >>> All, >>> >>> I'm using the Spark shell to interact with a small test deployment of >>> Spark, built from the current master branch. I'm processing a dataset >>> comprising a few thousand objects on Google Cloud Storage, split into a >>> half dozen directories. My code constructs an object--let me call it the >>> Dataset object--that defines a distinct RDD for each directory. The >>> constructor of the object only defines the RDDs; it does not actually >>> evaluate them, so I would expect it to return very quickly. Indeed, the >>> logging code in the constructor prints a line signaling the completion of >>> the code almost immediately after invocation, but the Spark shell does not >>> show the prompt right away. Instead, it spends a few minutes seemingly >>> frozen, eventually producing the following output: >>> >>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to >>> process : 9 >>> >>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to >>> process : 759 >>> >>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to >>> process : 228 >>> >>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to >>> process : 3076 >>> >>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to >>> process : 1013 >>> >>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to >>> process : 156 >>> >>> This stage is inexplicably slow. What could be happening? >>> >>> Thanks. >>> >>> >>> Alex >>> >>