So DataFrames, like RDDs, can only be accused from the driver. If your IP Frequency table is small enough you could collect it and distribute it as a hashmap with broadcast or you could also join your rdd with the ip frequency table. Hope that helps :)
On Thursday, May 21, 2015, ping yan <sharon...@gmail.com> wrote: > I have a dataframe as a reference table for IP frequencies. > e.g., > > ip freq > 10.226.93.67 1 > 10.226.93.69 1 > 161.168.251.101 4 > 10.236.70.2 1 > 161.168.251.105 14 > > > All I need is to query the df in a map. > > rdd = sc.parallelize(['208.51.22.18', '31.207.6.173', '208.51.22.18']) > > freqs = rdd.map(lambda x: df.where(df.ip ==x ).first()) > > It doesn't get through.. would appreciate any help. > > Thanks! > Ping > > > > > -- > Ping Yan > Ph.D. in Management > Dept. of Management Information Systems > University of Arizona > Tucson, AZ 85721 > > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau Linked In: https://www.linkedin.com/in/holdenkarau