Thanks Xiayun Sun, Robin East for your inputs. It make sense to me.
Thanks & Regards,
Gokula Krishnan* (Gokul)*
On Tue, Jul 25, 2017 at 9:55 AM, Xiayun Sun wrote:
> I'm guessing by "part files" you mean files like part-r-0. These are
> actually different from hadoop
I'm guessing by "part files" you mean files like part-r-0. These are
actually different from hadoop "block size", which is the value actually
used in partitions.
Looks like your hdfs block size is the default 128mb: 258.2GB in 500 part
files -> around 528mb per part file -> each part file
sc.textFile will use the Hadoop TextInputFormat (I believe), this will use the
Hadoop block size to read records from HDFS. Most likely the block size is
128MB. Not sure you can do anything about the number of tasks generated to read
from HDFS.
Excuse for the too many mails on this post.
found a similar issue
https://stackoverflow.com/questions/24671755/how-to-partition-a-rdd
Thanks & Regards,
Gokula Krishnan* (Gokul)*
On Tue, Jul 25, 2017 at 8:21 AM, Gokula Krishnan D
wrote:
> In addition to that,
>
> tried to
In addition to that,
tried to read the same file with 3000 partitions but it used 3070
partitions. And took more time than previous please refer the attachment.
Thanks & Regards,
Gokula Krishnan* (Gokul)*
On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D
wrote:
> Hello
Hello All,
I have a HDFS file with approx. *1.5 Billion records* with 500 Part files
(258.2GB Size) and when I tried to execute the following I could see that
it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't it?
val inputFile =
val inputRdd = sc.textFile(inputFile)