Re: Finding a Spark Equivalent for Pandas' get_dummies

Nicholas Sharkey Fri, 11 Nov 2016 13:21:49 -0800

I did get *some* help from DataBricks in terms of programmatically grabbing
the categorical variables but I can't figure out where to go from here:


*# Get all string cols/categorical cols*
*stringColList = [i[0] for i in df.dtypes if i[1] == 'string']*

*# generate OHEs for every col in stringColList*
*OHEstages = [OneHotEncoder(inputCol = categoricalCol, outputCol =
categoricalCol + "Vector") for categoricalCol in stringColList]*



On Fri, Nov 11, 2016 at 2:00 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> For now OHE supports a single column. So you have to have 1000 OHE in a
> pipeline. However you can add them programatically so it is not too bad. If
> the cardinality of each feature is quite low, it should be workable.
>
> After that user VectorAssembler to stitch the vectors together (which
> accepts multiple input columns).
>
> The other approach is - if your features are all categorical - to encode
> the features as "feature_name=feature_value" strings. This can
> unfortunately only be done with RDD ops since a UDF can't accept multiple
> columns as input at this time. You can create a new column with all the
> feature name/value pairs as a list of strings ["feature_1=foo",
> "feature_2=bar", ...]. Then use CountVectorizer to create your binary
> vectors. This basically works like the DictVectorizer in scikit-learn.
>
>
>
> On Fri, 11 Nov 2016 at 20:33 nsharkey <nicholasshar...@gmail.com> wrote:
>
>> I have a dataset that I need to convert some of the the variables to
>> dummy variables. The get_dummies function in Pandas works perfectly on
>> smaller datasets but since it collects I'll always be bottlenecked by the
>> master node.
>>
>> I've looked at Spark's OHE feature and while that will work in theory I
>> have over a thousand variables I need to convert so I don't want to have to
>> do 1000+ OHE. My project is pretty simple in scope: read in a raw CSV,
>> convert the categorical variables into dummy variables, then save the
>> transformed data back to CSV. That is why I'm so interested in get_dummies
>> but it's not scalable enough for my data size (500-600GB per file).
>>
>> Thanks in advance.
>>
>> Nick
>>
>> ------------------------------
>> View this message in context: Finding a Spark Equivalent for Pandas'
>> get_dummies
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Finding-a-Spark-Equivalent-for-Pandas-get-dummies-tp28064.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>

Re: Finding a Spark Equivalent for Pandas' get_dummies

Reply via email to