Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

2020-10-30 Thread Daniel Chalef
Yes, the resulting matrix would be sparse. Thanks for the suggestion. Will explore ways of doing this using an agg and UDF. On Fri, Oct 30, 2020 at 6:26 AM Patrick McCarthy wrote: > That's a very large vector. Is it sparse? Perhaps you'd have better luck > performing an aggregate instead of a pi

Re: Debugging tools for Spark Structured Streaming

2020-10-30 Thread Artemis User
Spark distribute loads to executors and the executors are usually pre-configured with the number of cores.  You may want to check with your Spark admin on how many executors (or slaves) your Spark cluster is configured with and how many cores are pre-configured for executors.  The debugging too

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

2020-10-30 Thread Patrick McCarthy
That's a very large vector. Is it sparse? Perhaps you'd have better luck performing an aggregate instead of a pivot, and assembling the vector using a UDF. On Thu, Oct 29, 2020 at 10:19 PM Daniel Chalef wrote: > Hello, > > I have a very large long-format dataframe (several billion rows) that I'd