Hi Adrian, Did you try SQLTransformer? Your preprocessing steps are SQL operations and can be handled by SQLTransformer in MLlib pipeline scope.
Thanks Yanbo On Thu, Mar 9, 2017 at 11:02 AM, aATv <adr...@vidora.com> wrote: > I want to start using PySpark Mllib pipelines, but I don't understand > how/where preprocessing fits into the pipeline. > > My preprocessing steps are generally in the following form: > 1) Load log files(from s3) and parse into a spark Dataframe with columns > user_id, event_type, timestamp, etc > 2) Group by a column, then pivot and count another column > - e.g. df.groupby("user_id").pivot("event_type").count() > - We can think of the columns that this creates besides user_id as > features, where the number of each event type is a different feature > 3) Join the data from step 1 with other metadata, usually stored in > Cassandra. Then perform a transformation similar to one from step 2), where > the column that is pivoted and counted is a column that came from the data > stored in Cassandra. > > After this preprocessing, I would use transformers to create other features > and feed it into a model, lets say Logistic Regression for example. > > I would like to make at lease step 2 a custom transformer and add that to a > pipeline, but it doesn't fit the transformer abstraction. This is because > it > takes a single input column and outputs multiple columns. It also has a > different number of input rows than output rows due to the group by > operation. > > Given that, how do I fit this into a Mllib pipeline, and it if doesn't fit > as part of a pipeline, what is the best way to include it in my code so > that > it can easily be reused both for training and testing, as well as in > production. > > I'm using pyspark 2.1 and here is an example of 2) > > > > > Note: My question is in some way related to this question, but I don't > think > it is answered here: > http://apache-spark-developers-list.1001551.n3.nabble.com/Why-can-t-a- > Transformer-have-multiple-output-columns-td18689.html > <http://apache-spark-developers-list.1001551.n3.nabble.com/Why-can-t-a- > Transformer-have-multiple-output-columns-td18689.html> > > Thanks > Adrian > > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/How-does-preprocessing-fit-into- > Spark-MLlib-pipeline-tp28473.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >