[Spark Core] Vectorizing very high-dimensional data sourced in long format

Daniel Chalef Thu, 29 Oct 2020 19:19:25 -0700

Hello,

I have a very large long-format dataframe (several billion rows) that I'd
like to pivot and vectorize (using the VectorAssembler), with the aim to
reduce dimensionality using something akin to TF-IDF. Once pivoted, the
dataframe will have ~130 million columns.


The source, long-format schema looks as follows:

root
 |-- entity_id: long (nullable = false)
 |-- attribute_id: long (nullable = false)
 |-- event_count: integer (nullable = true)

Pivoting as per the following fails, exhausting executor and driver memory.
I am unsure whether increasing memory limits would be successful here as my
sense is that pivoting and then using a VectorAssembler isn't the right
approach to solving this problem.

wide_frame = (
    long_frame.groupBy("entity_id")
    .pivot("attribute_id")
    .agg(F.first("event_count"))
)

Are there other Spark patterns that I should attempt in order to achieve my
end goal of a vector of attributes for every entity?

Thanks, Daniel

[Spark Core] Vectorizing very high-dimensional data sourced in long format

Reply via email to