Hi, 

  i have been using MapReduce to analyze multiple files whose size can range
from 10 mb to 200mb per file. recently i  planned to move spark , but my
spark Job is taking too much time executing a single file in case my file
size is 10MB and hdfs block size is 64MB. It is executing on a single
datanode and on single core(my cluster is a 4 node setup / each node having
32 cores). each file is having 3 million rows and i have to analyze each
row(ignore none) and create a set of info from it.

Isn't a way where i can parallelize the processing of the file like either
on other nodes or use the remaining cores of the same node. 
 


demo code : 

     val recordsRDD = 
SparkContext.sequenceFile[NullWritable,BytesWritable](FilePath,256) /*to
parallelize */

     infoRdd = recordsRDD.map(f => info_func())
    
     hdfs_RDD = infoRDD.reduceByKey(_+_,48)  /* makes 48 paritions*/

    hdfs_RDD.saveAsNewAPIHadoopFile



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-running-slow-for-small-hadoop-files-of-10-mb-size-tp4526.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to