Re: weird issue with sqlContext.createDataFrame - pyspark 1.3.1

2015-10-09 Thread ping yan
`` > >>> import pandas > >>> pandas.__version__ > '0.14.0' > ``` > > On Thu, Oct 8, 2015 at 10:28 PM, ping yan wrote: > > I really cannot figure out what this is about.. > > (tried to import pandas, in case that is a dependency, but it didn&#

weird issue with sqlContext.createDataFrame - pyspark 1.3.1

2015-10-08 Thread ping yan
I really cannot figure out what this is about.. (tried to import pandas, in case that is a dependency, but it didn't help.) >>> from pyspark.sql import SQLContext >>> sqlContext=SQLContext(sc) >>> sqlContext.createDataFrame(l).collect() Traceback (most recent call last): File "", line 1, in F

Spark FP-Growth algorithm for frequent sequential patterns

2015-06-19 Thread ping yan
Hi, I have a use case where I'd like to mine frequent sequential patterns (consider the clickpath scenario). Transaction A -> B doesn't equal Transaction B->A.. >From what I understand about FP-growth in general and the MLlib implementation of it, the orders are not preserved. Anyone can provide

Re: RDD of RDDs

2015-06-10 Thread ping yan
r nodes will not be able to >execute the *filter* on *innerRDD *as the code in the worker does not >have access to "sc" and can not launch a spark job. > > > Hope it helps. You need to consider List[RDD] or some other collection. > > -Kiran > > On Tue, Jun 9,

RDD of RDDs

2015-06-08 Thread ping yan
hoices left seem to be: 1) groupByKey() and then work with the ResultIterable object; 2) groupbyKey() and then write each group into a file, and read them back as individual rdds to process.. Anyone got a better idea or had a similar problem before? Thanks! Ping -- Ping Yan Ph.D. in Manag

Re: Query a Dataframe in rdd.map()

2015-05-21 Thread ping yan
the ip > frequency table. Hope that helps :) > > > On Thursday, May 21, 2015, ping yan wrote: > >> I have a dataframe as a reference table for IP frequencies. >> e.g., >> >> ip freq >> 10.226.93.67 1 >> 10

Query a Dataframe in rdd.map()

2015-05-21 Thread ping yan
22.18', '31.207.6.173', '208.51.22.18']) freqs = rdd.map(lambda x: df.where(df.ip ==x ).first()) It doesn't get through.. would appreciate any help. Thanks! Ping -- Ping Yan Ph.D. in Management Dept. of Management Information Systems University of Arizona Tucson, AZ 85721