Re: How to get the HDFS path for each RDD

2015-09-27 Thread Fengdong Yu
Hi Anchit, cat you create more than one data in each dataset to test again? > On Sep 26, 2015, at 18:00, Fengdong Yu wrote: > > Anchit, > > please ignore my inputs. you are right. Thanks. > > > >> On Sep 26, 2015, at 17:27, Fengdong Yu

Re: How to get the HDFS path for each RDD

2015-09-27 Thread Nicholas Chammas
Shouldn't this discussion be held on the user list and not the dev list? The dev list (this list) is for discussing development on Spark itself. Please move the discussion accordingly. Nick 2015년 9월 27일 (일) 오후 10:57, Fengdong Yu 님이 작성: > Hi Anchit, > cat you create

Re: How to get the HDFS path for each RDD

2015-09-26 Thread Fengdong Yu
Hi Anchit, this is not my expected, because you specified the HDFS directory in your code. I've solved like this: val text = sc.hadoopFile(Args.input, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], 2) val hadoopRdd =

Re: How to get the HDFS path for each RDD

2015-09-26 Thread Fengdong Yu
Anchit, please ignore my inputs. you are right. Thanks. > On Sep 26, 2015, at 17:27, Fengdong Yu wrote: > > Hi Anchit, > > this is not my expected, because you specified the HDFS directory in your > code. > I've solved like this: > >val text =

Re: How to get the HDFS path for each RDD

2015-09-24 Thread Anchit Choudhry
Hi Fengdong, Thanks for your question. Spark already has a function called wholeTextFiles within sparkContext which can help you with that: Python hdfs://a-hdfs-path/part-0hdfs://a-hdfs-path/part-1 ...hdfs://a-hdfs-path/part-n rdd =

Re: How to get the HDFS path for each RDD

2015-09-24 Thread Fengdong Yu
Hi Anchit, Thanks for the quick answer. my exact question is : I want to add HDFS location into each line in my JSON data. > On Sep 25, 2015, at 11:25, Anchit Choudhry wrote: > > Hi Fengdong, > > Thanks for your question. > > Spark already has a function

Re: How to get the HDFS path for each RDD

2015-09-24 Thread Anchit Choudhry
Hi Fengdong, So I created two files in HDFS under a test folder. test/dt=20100101.json { "key1" : "value1" } test/dt=20100102.json { "key2" : "value2" } Then inside PySpark shell rdd = sc.wholeTextFiles('./test/*') rdd.collect() [(u'hdfs://localhost:9000/user/hduser/test/dt=20100101.json',

How to get the HDFS path for each RDD

2015-09-24 Thread Fengdong Yu
Hi, I have multiple files with JSON format, such as: /data/test1_data/sub100/test.data /data/test2_data/sub200/test.data I can sc.textFile(“/data/*/*”) but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save it the one target HDFS location. how to do it, Thanks.

Re: How to get the HDFS path for each RDD

2015-09-24 Thread Fengdong Yu
yes. such as I have two data sets: date set A: /data/test1/dt=20100101 data set B: /data/test2/dt=20100202 all data has the same JSON format , such as: {“key1” : “value1”, “key2” : “value2” } my output expected: {“key1” : “value1”, “key2” : “value2” , “source” : “test1”, “date” : “20100101"}