OK. So several of these show pretty massive skew. That will be a problem. The key to look for is the difference between mean and median.
You also have a massively different scale for different variables. One variable has a range of 5, another has a range of 5.8 x 10^10. Interestingly, v1 not only has skew, but has negative values. This makes a simple-minded log transform not work. One handy transform that I have used in the past is to simply use a smoothed version of the empirical cumulative distribution function to transform variables. In R, the ecdf function can produce such a function. That puts all of your variables into a uniform [0,1] distribution. You can further transform this to a normal distribution if you care to. If that is too crazy, then try this: xv1 = log(max(1e-6, v1+55)) xv2 = log(max(1e-6, v2)) xv3 = log(max(1e-6, v3)) xv4 = v4 xv5 = v5 xv6 = v6 Then normalize all of the xv variables to have unit variance and zero mean. On Fri, Sep 7, 2012 at 8:23 AM, Mike Burba <[email protected]> wrote: > Took some massaging to get into R. Here is the output as requested > for the 6 predictor variables: > > > summary(x) > > v1 v2 > Min. : -55.0 Min. : 0.0 > 1st Qu.: 6.0 1st Qu.: 0.0 > Median : 62.0 Median : 2.0 > Mean : 658.7 Mean : 25.4 > 3rd Qu.: 391.0 3rd Qu.: 13.0 > Max. :461311.0 Max. :21532.0 > > v3 v4 > Min. :0.000e+00 Min. : 3.00 > 1st Qu.:1.821e+06 1st Qu.: 36.00 > Median :1.268e+07 Median : 47.00 > Mean :2.345e+07 Mean : 50.35 > 3rd Qu.:3.364e+07 3rd Qu.: 62.00 > Max. :5.820e+10 Max. :257.00 > > v5 v6 > Min. : 0.0 Min. :1.000 > 1st Qu.: 356.0 1st Qu.:2.000 > Median : 623.0 Median :3.000 > Mean : 956.7 Mean :2.862 > 3rd Qu.: 1100.0 3rd Qu.:4.000 > Max. :33413.0 Max. :5.000 > > So now I am going through the process of transforming / scaling. Any > top-of-mind thoughts on the output above are welcome...to help me > validate my thought process. > > Thanks for the hints, I will let you know how it turns out. > > Mike > > On Thu, Sep 6, 2012 at 8:14 PM, Ted Dunning <[email protected]> wrote: > > > > Try transforming them as well, likely with a log if they are positive and > > have heavily skewed values. > > > > Can you suck the data into R and paste in the results of summary(x)? > > (assuming you put the data into the variable x). This should look > > something like: > > > > > summary(x) > > > v1 v2 v3 > > > Min. :-3.41939 Min. :0.0002538 Min. :1.188 > > > 1st Qu.:-0.66695 1st Qu.:0.3122501 1st Qu.:3.321 > > > Median :-0.07277 Median :0.6830144 Median :3.972 > > > Mean :-0.05619 Mean :1.0286261 Mean :4.010 > > > 3rd Qu.: 0.56784 3rd Qu.:1.4619058 3rd Qu.:4.712 > > > Max. : 2.74271 Max. :7.7754864 Max. :7.252 > > > > > > > > > > On Thu, Sep 6, 2012 at 4:58 PM, Diederik van Liere < > > [email protected]> wrote: > > > > > > > > > - My (6) predictor variables are all numeric; some of the variables > range > > > > from 0...5, others range from 0...1,000,000. > > > Have you tried rescaling your predictor variables so they have the same > > > range? > > > > > > Diederik >
