The data partitionning is done by default *according to the number of HDFS blocks* of the source. You can change the partitionning with .repartion, either to increase or decrease the level of parallelism :
val recordsRDD = SparkContext.sequenceFile[NullWritable,BytesWritable](FilePath,256) val recordsRDDInParallel = recordsRDD.repartition(4*32) infoRdd = recordsRDDInParallel.map(f => info_func()) hdfs_RDD = infoRDD.reduceByKey(_+_,48) /* makes 48 paritions*/ hdfs_RDD.saveAsNewAPIHadoopFile André On 2014-04-21 13:21, neeravsalaria wrote:
Hi, i have been using MapReduce to analyze multiple files whose size can range from 10 mb to 200mb per file. recently i planned to move spark , but my spark Job is taking too much time executing a single file in case my file size is 10MB and hdfs block size is 64MB. It is executing on a single datanode and on single core(my cluster is a 4 node setup / each node having 32 cores). each file is having 3 million rows and i have to analyze each row(ignore none) and create a set of info from it. Isn't a way where i can parallelize the processing of the file like either on other nodes or use the remaining cores of the same node. demo code : val recordsRDD = SparkContext.sequenceFile[NullWritable,BytesWritable](FilePath,256) /*to parallelize */ infoRdd = recordsRDD.map(f => info_func()) hdfs_RDD = infoRDD.reduceByKey(_+_,48) /* makes 48 paritions*/ hdfs_RDD.saveAsNewAPIHadoopFile -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-running-slow-for-small-hadoop-files-of-10-mb-size-tp4526.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
-- André Bois-Crettez Software Architect Big Data Developer http://www.kelkoo.com/ Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.