Re: [Scikit-learn-general] LabelEncoder with never seen before values
Another take on my previous question is this other question: Is fitting a LabelEncoder on the *entire* dataset (instead of only on the training set) an equivalent sin (i.e. a common ML mistake) as say doing so with a Scaler or some other preprocessing technique? If the answer is yes (which is what I assume because it can be considered I guess as a form of data leakage), what is the standard way to solve the issue of test values (for a categorical variable) that have never been encountered in the training set? On 9 January 2014 15:21, Christian Jauvin cjau...@gmail.com wrote: Hi, If a LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set. The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to unknown, and then explicitly add a corresponding class to the LabelEncoder afterward: # train and test are pandas.DataFrame's and c is whatever column le = LabelEncoder() train[c] = le.fit_transform(train[c]) test[c] = test[c].map(lambda s: 'unknown' if s not in le.classes_ else s) le.classes_ = np.append(le.classes_, 'unknown') test[c] = le.transform(test[c]) This works, but is there a better solution? -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Theil-Sen estimator for a multiple linear regression problem
Hi, at Blue Yonder we often use Scikit-Learn but are sometimes missing more robust regression methods that are not based on the L2 norm. So far I only knew Theil-Sen as a linear regression method with only a single explanatory variable. The work of Xin Dang, Hanxiang Peng, Xueqin Wang and Heping Zhang extend the method to n explanatory variables. So it should perfectly fit into the sklearn.linear_model subpackage I think. Where is the line drawn between functionality that should go into StatsModels and into Scikit-Learn with respect to regression methods? Florian On 10 January 2014 19:18, Skipper Seabold jsseab...@gmail.com wrote: Hi, There have been some implementations of Theil-Sen floating around for inclusion in statsmodels, but no PRs yet. IMO it might fit in a little better in statsmodels.robust than sklearn unless their are some aspects of Theil-Sen I'm not familiar with. Skipper Sent from my mobile On Jan 10, 2014, at 12:16 PM, florian.wilh...@gmail.com florian.wilh...@gmail.com wrote: Hi, I'd like to add a Theil-Sen estimator for a multiple linear regression problem to Scikit-Learn as described in the paper: http://home.olemiss.edu/~xdang/papers/MTSE.pdf Is anyone already working on this or are there any objections regarding the inclusion of a Theil-Sen estimator into Scikit-Learn? Best regards, Florian Wilhelm -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Theil-Sen estimator for a multiple linear regression problem
hi, did you try SVR ? eventually setting epsilon to 0.? if it's too slow have a look at lightning new LinearSVR estimator. Alex On Sat, Jan 11, 2014 at 7:28 PM, florian.wilh...@gmail.com florian.wilh...@gmail.com wrote: Hi, at Blue Yonder we often use Scikit-Learn but are sometimes missing more robust regression methods that are not based on the L2 norm. So far I only knew Theil-Sen as a linear regression method with only a single explanatory variable. The work of Xin Dang, Hanxiang Peng, Xueqin Wang and Heping Zhang extend the method to n explanatory variables. So it should perfectly fit into the sklearn.linear_model subpackage I think. Where is the line drawn between functionality that should go into StatsModels and into Scikit-Learn with respect to regression methods? Florian On 10 January 2014 19:18, Skipper Seabold jsseab...@gmail.com wrote: Hi, There have been some implementations of Theil-Sen floating around for inclusion in statsmodels, but no PRs yet. IMO it might fit in a little better in statsmodels.robust than sklearn unless their are some aspects of Theil-Sen I'm not familiar with. Skipper Sent from my mobile On Jan 10, 2014, at 12:16 PM, florian.wilh...@gmail.com florian.wilh...@gmail.com wrote: Hi, I'd like to add a Theil-Sen estimator for a multiple linear regression problem to Scikit-Learn as described in the paper: http://home.olemiss.edu/~xdang/papers/MTSE.pdf Is anyone already working on this or are there any objections regarding the inclusion of a Theil-Sen estimator into Scikit-Learn? Best regards, Florian Wilhelm -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general