I did get *some* help from DataBricks in terms of programmatically grabbing the categorical variables but I can't figure out where to go from here:
*# Get all string cols/categorical cols* *stringColList = [i[0] for i in df.dtypes if i[1] == 'string']* *# generate OHEs for every col in stringColList* *OHEstages = [OneHotEncoder(inputCol = categoricalCol, outputCol = categoricalCol + "Vector") for categoricalCol in stringColList]* On Fri, Nov 11, 2016 at 2:00 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > For now OHE supports a single column. So you have to have 1000 OHE in a > pipeline. However you can add them programatically so it is not too bad. If > the cardinality of each feature is quite low, it should be workable. > > After that user VectorAssembler to stitch the vectors together (which > accepts multiple input columns). > > The other approach is - if your features are all categorical - to encode > the features as "feature_name=feature_value" strings. This can > unfortunately only be done with RDD ops since a UDF can't accept multiple > columns as input at this time. You can create a new column with all the > feature name/value pairs as a list of strings ["feature_1=foo", > "feature_2=bar", ...]. Then use CountVectorizer to create your binary > vectors. This basically works like the DictVectorizer in scikit-learn. > > > > On Fri, 11 Nov 2016 at 20:33 nsharkey <nicholasshar...@gmail.com> wrote: > >> I have a dataset that I need to convert some of the the variables to >> dummy variables. The get_dummies function in Pandas works perfectly on >> smaller datasets but since it collects I'll always be bottlenecked by the >> master node. >> >> I've looked at Spark's OHE feature and while that will work in theory I >> have over a thousand variables I need to convert so I don't want to have to >> do 1000+ OHE. My project is pretty simple in scope: read in a raw CSV, >> convert the categorical variables into dummy variables, then save the >> transformed data back to CSV. That is why I'm so interested in get_dummies >> but it's not scalable enough for my data size (500-600GB per file). >> >> Thanks in advance. >> >> Nick >> >> ------------------------------ >> View this message in context: Finding a Spark Equivalent for Pandas' >> get_dummies >> <http://apache-spark-user-list.1001560.n3.nabble.com/Finding-a-Spark-Equivalent-for-Pandas-get-dummies-tp28064.html> >> Sent from the Apache Spark User List mailing list archive >> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >> >