Thanks. I suspected that, but figured that df query inside a map sounds so
intuitive that I don't just want to give up.

I've tried join and even better with a DStream.transform() and it works!
freqs = testips.transform(lambda rdd: rdd.join(kvrdd).map(lambda (x,y):
y[1]))

Thank you for the help!

Ping

On Thu, May 21, 2015 at 10:40 AM, Holden Karau <hol...@pigscanfly.ca> wrote:

> So DataFrames, like RDDs, can only be accused from the driver. If your IP
> Frequency table is small enough you could collect it and distribute it as a
> hashmap with broadcast or you could also join your rdd with the ip
> frequency table. Hope that helps :)
>
>
> On Thursday, May 21, 2015, ping yan <sharon...@gmail.com> wrote:
>
>> I have a dataframe as a reference table for IP frequencies.
>> e.g.,
>>
>> ip                       freq
>> 10.226.93.67         1
>> 10.226.93.69         1
>> 161.168.251.101   4
>> 10.236.70.2           1
>> 161.168.251.105 14
>>
>>
>> All I need is to query the df in a map.
>>
>> rdd = sc.parallelize(['208.51.22.18', '31.207.6.173', '208.51.22.18'])
>>
>> freqs = rdd.map(lambda x: df.where(df.ip ==x ).first())
>>
>> It doesn't get through.. would appreciate any help.
>>
>> Thanks!
>> Ping
>>
>>
>>
>>
>> --
>> Ping Yan
>> Ph.D. in Management
>> Dept. of Management Information Systems
>> University of Arizona
>> Tucson, AZ 85721
>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
> Linked In: https://www.linkedin.com/in/holdenkarau
>
>


-- 
Ping Yan
Ph.D. in Management
Dept. of Management Information Systems
University of Arizona
Tucson, AZ 85721

Reply via email to