Re: [Ml][Dataframe] Ml pipeline dataframe repartitioning

2015-04-26 Thread Joseph Bradley
Hi Peter, As far as setting the parallelism, I would recommend setting it as early as possible. Ideally, that would mean specifying the number of partitions when loading the initial data (rather than repartitioning later on). In general, working with Vector columns should be better since the

[Ml][Dataframe] Ml pipeline dataframe repartitioning

2015-04-24 Thread Peter Rudenko
Hi i have a next problem. I have a dataset with 30 columns (15 numeric, 15 categorical) and using ml transformers/estimators to transform each column (StringIndexer for categorical MeanImputor for numeric). This creates 30 more columns in a dataframe. After i’m using VectorAssembler to