Hi Peter,
As far as setting the parallelism, I would recommend setting it as early as
possible. Ideally, that would mean specifying the number of partitions
when loading the initial data (rather than repartitioning later on).
In general, working with Vector columns should be better since the
Hi i have a next problem. I have a dataset with 30 columns (15 numeric,
15 categorical) and using ml transformers/estimators to transform each
column (StringIndexer for categorical MeanImputor for numeric). This
creates 30 more columns in a dataframe. After i’m using VectorAssembler
to