Hello, I have a very large long-format dataframe (several billion rows) that I'd like to pivot and vectorize (using the VectorAssembler), with the aim to reduce dimensionality using something akin to TF-IDF. Once pivoted, the dataframe will have ~130 million columns.
The source, long-format schema looks as follows: root |-- entity_id: long (nullable = false) |-- attribute_id: long (nullable = false) |-- event_count: integer (nullable = true) Pivoting as per the following fails, exhausting executor and driver memory. I am unsure whether increasing memory limits would be successful here as my sense is that pivoting and then using a VectorAssembler isn't the right approach to solving this problem. wide_frame = ( long_frame.groupBy("entity_id") .pivot("attribute_id") .agg(F.first("event_count")) ) Are there other Spark patterns that I should attempt in order to achieve my end goal of a vector of attributes for every entity? Thanks, Daniel