Find the file info of when load the data into RDD

2014-12-21 Thread Shuai Zheng
Hi All,

When I try to load a folder into the RDDs, any way for me to find the input
file name of particular partitions? So I can track partitions from which
file.

In the hadoop, I can find this information through the code:

FileSplit fileSplit = (FileSplit) context.getInputSplit();
String strFilename = fileSplit.getPath().getName();

But how can I do this in spark?

Regards,

Shuai


Re: Find the file info of when load the data into RDD

2014-12-21 Thread Shuai Zheng
I just found a possible answer:

http://themodernlife.github.io/scala/spark/hadoop/hdfs/2014/09/28/spark-input-filename/

Will give a try on it. Although it is a bit troublesome, but if it works,
will give what I want.

Sorry for bother everyone here

Regards,

Shuai

On Sun, Dec 21, 2014 at 4:43 PM, Shuai Zheng szheng.c...@gmail.com wrote:

 Hi All,

 When I try to load a folder into the RDDs, any way for me to find the
 input file name of particular partitions? So I can track partitions from
 which file.

 In the hadoop, I can find this information through the code:

 FileSplit fileSplit = (FileSplit) context.getInputSplit();
 String strFilename = fileSplit.getPath().getName();

 But how can I do this in spark?

 Regards,

 Shuai



Re: Find the file info of when load the data into RDD

2014-12-21 Thread Anwar Rizal
Yeah..., buat apparently mapPartitionsWithInputSplit thing
is mapPartitionsWithInputSplit is tagged as DeveloperApi. Because of that,
I'm not sure that it's a good idea to use the function.

For this problem, I had to create a subclass HadoopRDD and use
mapPartitions instead.

Is there any reason why mapPartitionsWithInputSplit  has DeveloperApi
annotation ? Is it possible to remove ?

Best regards,
Anwar Rizal.

On Sun, Dec 21, 2014 at 10:47 PM, Shuai Zheng szheng.c...@gmail.com wrote:

 I just found a possible answer:


 http://themodernlife.github.io/scala/spark/hadoop/hdfs/2014/09/28/spark-input-filename/

 Will give a try on it. Although it is a bit troublesome, but if it works,
 will give what I want.

 Sorry for bother everyone here

 Regards,

 Shuai

 On Sun, Dec 21, 2014 at 4:43 PM, Shuai Zheng szheng.c...@gmail.com
 wrote:

 Hi All,

 When I try to load a folder into the RDDs, any way for me to find the
 input file name of particular partitions? So I can track partitions from
 which file.

 In the hadoop, I can find this information through the code:

 FileSplit fileSplit = (FileSplit) context.getInputSplit();
 String strFilename = fileSplit.getPath().getName();

 But how can I do this in spark?

 Regards,

 Shuai