Re: How does preprocessing fit into Spark MLlib pipeline

Yanbo Liang Fri, 17 Mar 2017 07:17:29 -0700

Hi Adrian,

Did you try SQLTransformer? Your preprocessing steps are SQL operations and
can be handled by SQLTransformer in MLlib pipeline scope.


Thanks
Yanbo

On Thu, Mar 9, 2017 at 11:02 AM, aATv <[email protected]> wrote:

> I want to start using PySpark Mllib pipelines, but I don't understand
> how/where preprocessing fits into the pipeline.
>
> My preprocessing steps are generally in the following form:
>    1) Load log files(from s3) and parse into a spark Dataframe with columns
> user_id, event_type, timestamp, etc
>    2) Group by a column, then pivot and count another column
>       - e.g. df.groupby("user_id").pivot("event_type").count()
>       - We can think of the columns that this creates besides user_id as
> features, where the number of each event type is a different feature
>    3) Join the data from step 1 with other metadata, usually stored in
> Cassandra. Then perform a transformation similar to one from step 2), where
> the column that is pivoted and counted is a column that came from the data
> stored in Cassandra.
>
> After this preprocessing, I would use transformers to create other features
> and feed it into a model, lets say Logistic Regression for example.
>
> I would like to make at lease step 2 a custom transformer and add that to a
> pipeline, but it doesn't fit the transformer abstraction. This is because
> it
> takes a single input column and outputs multiple columns.  It also has a
> different number of input rows than output rows due to the group by
> operation.
>
> Given that, how do I fit this into a Mllib pipeline, and it if doesn't fit
> as part of a pipeline, what is the best way to include it in my code so
> that
> it can easily be reused both for training and testing, as well as in
> production.
>
> I'm using pyspark 2.1 and here is an example of 2)
>
>
>
>
> Note: My question is in some way related to this question, but I don't
> think
> it is answered here:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-can-t-a-
> Transformer-have-multiple-output-columns-td18689.html
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Why-can-t-a-
> Transformer-have-multiple-output-columns-td18689.html>
>
> Thanks
> Adrian
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/How-does-preprocessing-fit-into-
> Spark-MLlib-pipeline-tp28473.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Re: How does preprocessing fit into Spark MLlib pipeline

Reply via email to