Re: How to transform variables?

Pat Ferrel Fri, 20 Jan 2017 15:37:58 -0800

I see PIO as a production big-data pipeline. It sounds like what you need is a 
math framework that is pretty much interactive where you can change the 
function and do some cross-validation in nearly real time. This seems to imply 
R, Python, of Scala + Mahout Samsara + Zeppelin. Of these Mahout is the only 
interactive tool that runs on a Spark cluster backend and so can crunch a lot 
of data in the interactive Scala shell. If you don’t need big-data, the others 
might be more familiar.  There are lot of regression algorithms prepackaged in 
those and some in PIO templates.


Then when you have the algorithm designed, put the parameters in engine.json so 
you won’t have to change code to tune and put it in PIO for everyday production 
learning/prediction. 


On Jan 20, 2017, at 10:17 AM, Daniel Gabrieli <[email protected]> wrote:

Thank you. That is helpful.  More specifically, I am trying to implement a 
regression of a form like this:

write_score = B0 + B2*log(math) + B3*log(read)

Where a student's predicted writing score is a function of gender, the log of a 
math score, the log of a reading score. 

But in fact, what I am trying to understand is how to do feature engineering 
inside of PIO.  I want to try various manipulations of the data to figure out 
what the best features are for a given model (log is a common example).  I 
might want to try, for example, another regression like:

write_score = B0 + B2*(math - read)^2

Where the score on writing is a function of the squared difference between the 
math and reading scores.

I'd prefer manipulate variables within the PIO Engine because the servers that 
send the event data to PIO are "just dump pipes" and I'd like to keep the "data 
science" logic outside of those pipes and inside of PIO as much as possible.




On Fri, Jan 20, 2017 at 12:45 PM Pat Ferrel <[email protected] 
<mailto:[email protected]>> wrote:
It would help to know what you are trying to implement.

The datasource and preparator are used only during the input part of train, 
they pass data to the train method of your algorithm when you run `pio train`. 
The predict method does not use them at all. It may get data from the 
EventStore, but not through those other classes. 

If you need data to always be the log of some number you may want to take the 
log before it is sent to the EventServer so it will always be a log, event when 
you get the Query or out of the EventSever. 


On Jan 20, 2017, at 5:13 AM, Daniel Gabrieli <[email protected] 
<mailto:[email protected]>> wrote:

Hi,

I am a new to PIO.

I have a variable called X that I would like take the log of during training 
and then during prediction as well.  Where is the appropriate place to put the 
log function?

My guess is to override the "prepare" method; while I think the prepare method 
is called just before training, I am not clear whether it is also called before 
prediction.

Do I call the log transformation again somewhere else so that it occurs during 
prediction?  Possibly in the predict method?

Thank you,


 
prepare

Re: How to transform variables?

Reply via email to