Hi Anchit,
cat you create more than one data in each dataset to test again?
> On Sep 26, 2015, at 18:00, Fengdong Yu wrote:
>
> Anchit,
>
> please ignore my inputs. you are right. Thanks.
>
>
>
>> On Sep 26, 2015, at 17:27, Fengdong Yu
Shouldn't this discussion be held on the user list and not the dev list?
The dev list (this list) is for discussing development on Spark itself.
Please move the discussion accordingly.
Nick
2015년 9월 27일 (일) 오후 10:57, Fengdong Yu 님이 작성:
> Hi Anchit,
> cat you create
Hi Anchit,
this is not my expected, because you specified the HDFS directory in your code.
I've solved like this:
val text = sc.hadoopFile(Args.input,
classOf[TextInputFormat], classOf[LongWritable],
classOf[Text], 2)
val hadoopRdd =
Anchit,
please ignore my inputs. you are right. Thanks.
> On Sep 26, 2015, at 17:27, Fengdong Yu wrote:
>
> Hi Anchit,
>
> this is not my expected, because you specified the HDFS directory in your
> code.
> I've solved like this:
>
>val text =
Hi Fengdong,
Thanks for your question.
Spark already has a function called wholeTextFiles within sparkContext
which can help you with that:
Python
hdfs://a-hdfs-path/part-0hdfs://a-hdfs-path/part-1
...hdfs://a-hdfs-path/part-n
rdd =
Hi Anchit,
Thanks for the quick answer.
my exact question is : I want to add HDFS location into each line in my JSON
data.
> On Sep 25, 2015, at 11:25, Anchit Choudhry wrote:
>
> Hi Fengdong,
>
> Thanks for your question.
>
> Spark already has a function
Hi Fengdong,
So I created two files in HDFS under a test folder.
test/dt=20100101.json
{ "key1" : "value1" }
test/dt=20100102.json
{ "key2" : "value2" }
Then inside PySpark shell
rdd = sc.wholeTextFiles('./test/*')
rdd.collect()
[(u'hdfs://localhost:9000/user/hduser/test/dt=20100101.json',
Hi,
I have multiple files with JSON format, such as:
/data/test1_data/sub100/test.data
/data/test2_data/sub200/test.data
I can sc.textFile(“/data/*/*”)
but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save it
the one target HDFS location.
how to do it, Thanks.
yes. such as I have two data sets:
date set A: /data/test1/dt=20100101
data set B: /data/test2/dt=20100202
all data has the same JSON format , such as:
{“key1” : “value1”, “key2” : “value2” }
my output expected:
{“key1” : “value1”, “key2” : “value2” , “source” : “test1”, “date” : “20100101"}