Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
Thanks Xiayun Sun, Robin East for your inputs. It make sense to me. Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Jul 25, 2017 at 9:55 AM, Xiayun Sun wrote: > I'm guessing by "part files" you mean files like part-r-0. These are > actually different from hadoop

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Xiayun Sun
I'm guessing by "part files" you mean files like part-r-0. These are actually different from hadoop "block size", which is the value actually used in partitions. Looks like your hdfs block size is the default 128mb: 258.2GB in 500 part files -> around 528mb per part file -> each part file

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Robin East
sc.textFile will use the Hadoop TextInputFormat (I believe), this will use the Hadoop block size to read records from HDFS. Most likely the block size is 128MB. Not sure you can do anything about the number of tasks generated to read from HDFS.

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
Excuse for the too many mails on this post. found a similar issue https://stackoverflow.com/questions/24671755/how-to-partition-a-rdd Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Jul 25, 2017 at 8:21 AM, Gokula Krishnan D wrote: > In addition to that, > > tried to

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
In addition to that, tried to read the same file with 3000 partitions but it used 3070 partitions. And took more time than previous please refer the attachment. Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D wrote: > Hello

[Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
Hello All, I have a HDFS file with approx. *1.5 Billion records* with 500 Part files (258.2GB Size) and when I tried to execute the following I could see that it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't it? val inputFile = val inputRdd = sc.textFile(inputFile)