I have a dataset that I need to convert some of the the variables to dummy
variables. The get_dummies function in Pandas works perfectly on smaller
datasets but since it collects I'll always be bottlenecked by the master
node.

I've looked at Spark's OHE feature and while that will work in theory I
have over a thousand variables I need to convert so I don't want to have to
do 1000+ OHE. My project is pretty simple in scope: read in a raw CSV,
convert the categorical variables into dummy variables, then save the
transformed data back to CSV. That is why I'm so interested in get_dummies
but it's not scalable enough for my data size (500-600GB per file).

Thanks in advance.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-a-Spark-Equivalent-for-Pandas-get-dummies-tp28064.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to