I'm working on large scale logistic regression for ctr prediction, and when
user interaction for feature engineer, driver OOM. For detail, I interact
among userid(one-hot, 30w dimension, sparse) and base features(60
dimensions, dense), driver memory is set to 40g.

So, I try to debug from remote, and I find the spark interaction create a
big schema, and a lot job is doing at the driver.

there is two question:

By reading source, I found interaction is implemented with sparse vector,
so it does not need so much memory, and why it need do this at the driver.
The interaction result is 1800w dimension sparse dataframe, why 1800w
structField for schema is so big. this is dump file when the schema begins
to create because it is too big, I can't dump all:
https://i.stack.imgur.com/h0XBf.jpg

So I implement interaction method with RDD, the job can finish in 5mim, so
I am wondering it's there any wrong here.

-- 
fun coding

Reply via email to