Confused about shuffle read and shuffle write

2015-01-21 Thread Darin McBeath
 I have the following code in a Spark Job.

// Get the baseline input file(s)
JavaPairRDDText,Text hsfBaselinePairRDDReadable = 
sc.hadoopFile(baselineInputBucketFile, SequenceFileInputFormat.class, 
Text.class, Text.class);
JavaPairRDDString, String hsfBaselinePairRDD = 
hsfBaselinePairRDDReadable.mapToPair(newConvertFromWritableTypes()).partitionBy(newHashPartitioner(Variables.NUM_OUTPUT_PARTITIONS)).persist(StorageLevel.MEMORY_ONLY_SER());

// Use 'substring' to extract epoch values.
JavaPairRDDString, Long baselinePairRDD = 
hsfBaselinePairRDD.mapValues(newExtractEpoch(accumBadBaselineRecords)).persist(StorageLevel.MEMORY_ONLY_SER());

When looking at the STAGE information for my job, I notice the following:

To construct hsfBaselinePairRDD, it takes about 7.5 minutes, with 931GB of 
input (from S3) and 377GB of shuffle write (presumably because of the hash 
partitioning).  This all makes sense.

To construct the baselinePairRDD, it also takes about 7.5 minutes.  I thought 
that was a bit odd.  But what I thought was really odd is why there was also 
330GB of shuffle read in this stage.  I would have thought there should be 0 
shuffle read in this stage.  

What I'm confused about is why there is even any 'shuffle read' when 
constructing the baselinePairRDD.  If anyone could shed some light on this it 
would be appreciated.

Thanks.

Darin.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Confused about shuffle read and shuffle write

2015-01-20 Thread Darin McBeath
I have the following code in a Spark Job.
 // Get the baseline input file(s) JavaPairRDDText,Text 
hsfBaselinePairRDDReadable = sc.hadoopFile(baselineInputBucketFile, 
SequenceFileInputFormat.class, Text.class, Text.class); JavaPairRDDString, 
String hsfBaselinePairRDD = hsfBaselinePairRDDReadable.mapToPair(new 
ConvertFromWritableTypes()).partitionBy(new 
HashPartitioner(Variables.NUM_OUTPUT_PARTITIONS)).persist(StorageLevel.MEMORY_ONLY_SER());
 
 // Use 'substring' to extract epoch values. JavaPairRDDString, Long 
baselinePairRDD = hsfBaselinePairRDD.mapValues(new 
ExtractEpoch(accumBadBaselineRecords)).persist(StorageLevel.MEMORY_ONLY_SER());
When looking at the STAGE information for my job, I notice the following:
To construct hsfBaselinePairRDD, it takes about 7.5 minutes, with 931GB of 
input (from S3) and 377GB of shuffle write (presumably because of the hash 
partitioning).  This all makes sense.
To construct the baselinePairRDD, it also takes about 7.5 minutes.  I thought 
that was a bit odd.  But what I thought was really odd is why there was also 
330GB of shuffle read in this stage.  I would have thought there should be 0 
shuffle read in this stage.  
What I'm confused about is why there is even any 'shuffle read' when 
constructing the baselinePairRDD.  If anyone could shed some light on this it 
would be appreciated.
Thanks.
Darin.