Hi Romi,
Yes, you understand it correctly.And rdd1 keys are cross with rdd2 keys, that 
is, there are lots of same keys between rdd1 and rdd2, and there are some keys 
inrdd1 but not in rdd2, there are also some keys in rdd2 but not in rdd1.Then 
rdd3 keys would be same with rdd1 keys, rdd3 will not include the keys in rdd2 
but not in rdd1, values of rdd3 will comefrom rdd2, if the keys in rdd3 is not 
in rdd2 its value would  NOT exist.

You are always much perfect in spark and  having the solution about the 
questions, really appreciate you very much.
Thank you very much~
Zhiliang  


     On Tuesday, September 22, 2015 4:08 AM, Romi Kuntsman <r...@totango.com> 
wrote:
   

 Hi,
If I understand correctly:
rdd1 contains keys (of type StringDate)
rdd2 contains keys and values
and rdd3 contains all the keys, and the values from rdd2?

I think you should make rdd1 and rdd2 PairRDD, and then use outer join.
Does that make sense?

On Mon, Sep 21, 2015 at 8:37 PM Zhiliang Zhu <zchl.j...@yahoo.com> wrote:

Dear Romi, Priya, Sujt and Shivaram and all,
I have took lots of days to think into this issue, however, without  any enough 
good solution...I shall appreciate your all kind help.
There is an RDD<StringDate> rdd1, and another RDD<StringDate, float> rdd2, 
(rdd2 can be PairRDD, or DataFrame with two columns as <StringDate, 
float>).StringDate column values from rdd1 and rdd2 are cross but not the same.

I would like to get a new RDD<StringDate, float> rdd3, StringDate in rdd3 would 
be all from (same) as rdd1, and float in rdd3 would be from rdd2 if its 
StringDate is in rdd2, or else NULL would be assigned.
each row in rdd3[ i ] = <rdd1[ i ].StringDate, rdd2[ i ].float or NULL>, 
rdd2[i].StringDate would be same as rdd1[ i ].StringDate, 
then rdd2[ i ].float is assigned rdd3[ i ] StringDate part. What kinds of API 
or function would I use...
Thanks very much!Zhiliang




  

Reply via email to