Re: [Scikit-learn-general] LabelEncoder with never seen before values

2014-01-11 Thread Christian Jauvin
Another take on my previous question is this other question:

Is fitting a LabelEncoder on the *entire* dataset (instead of only on
the training set) an equivalent sin (i.e. a common ML mistake) as
say doing so with a Scaler or some other preprocessing technique?

If the answer is yes (which is what I assume because it can be
considered I guess as a form of data leakage), what is the standard
way to solve the issue of test values (for a categorical variable)
that have never been encountered in the training set?


On 9 January 2014 15:21, Christian Jauvin cjau...@gmail.com wrote:
 Hi,

 If a LabelEncoder has been fitted on a training set, it might break if it
 encounters new values when used on a test set.

 The only solution I could come up with for this is to map everything new in
 the test set (i.e. not belonging to any existing class) to unknown, and
 then explicitly add a corresponding class to the LabelEncoder afterward:

 # train and test are pandas.DataFrame's and c is whatever column
 le = LabelEncoder()
 train[c] = le.fit_transform(train[c])
 test[c] = test[c].map(lambda s: 'unknown' if s not in le.classes_ else s)
 le.classes_ = np.append(le.classes_, 'unknown')
 test[c] = le.transform(test[c])

 This works, but is there a better solution?


--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Theil-Sen estimator for a multiple linear regression problem

2014-01-11 Thread florian.wilh...@gmail.com
Hi,

at Blue Yonder we often use Scikit-Learn but are sometimes missing
more robust regression methods that are not based on the L2 norm.
So far I only knew Theil-Sen as a linear regression method with only a
single explanatory variable. The work of Xin Dang, Hanxiang Peng,
Xueqin Wang and Heping Zhang extend the method to n explanatory
variables. So it should perfectly fit into the sklearn.linear_model
subpackage I think. Where is the line drawn between functionality that
should go into StatsModels and into Scikit-Learn with respect to
regression methods?

Florian

On 10 January 2014 19:18, Skipper Seabold jsseab...@gmail.com wrote:
 Hi,

 There have been some implementations of Theil-Sen floating around for 
 inclusion in statsmodels, but no PRs yet. IMO it might fit in a little better 
 in statsmodels.robust than sklearn unless their are some aspects of Theil-Sen 
 I'm not familiar with.

 Skipper

 Sent from my mobile

 On Jan 10, 2014, at 12:16 PM, florian.wilh...@gmail.com 
 florian.wilh...@gmail.com wrote:

 Hi,

 I'd like to add a Theil-Sen estimator for a multiple linear regression
 problem to Scikit-Learn as described in the paper:
 http://home.olemiss.edu/~xdang/papers/MTSE.pdf
 Is anyone already working on this or are there any objections
 regarding the inclusion of a Theil-Sen estimator into Scikit-Learn?

 Best regards,

 Florian Wilhelm

 --
 CenturyLink Cloud: The Leader in Enterprise Cloud Services.
 Learn Why More Businesses Are Choosing CenturyLink Cloud For
 Critical Workloads, Development Environments  Everything In Between.
 Get a Quote or Start a Free Trial Today.
 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

 --
 CenturyLink Cloud: The Leader in Enterprise Cloud Services.
 Learn Why More Businesses Are Choosing CenturyLink Cloud For
 Critical Workloads, Development Environments  Everything In Between.
 Get a Quote or Start a Free Trial Today.
 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Theil-Sen estimator for a multiple linear regression problem

2014-01-11 Thread Alexandre Gramfort
hi,

did you try SVR ? eventually setting epsilon to 0.?

if it's too slow have a look at lightning new LinearSVR estimator.

Alex




On Sat, Jan 11, 2014 at 7:28 PM, florian.wilh...@gmail.com 
florian.wilh...@gmail.com wrote:

 Hi,

 at Blue Yonder we often use Scikit-Learn but are sometimes missing
 more robust regression methods that are not based on the L2 norm.
 So far I only knew Theil-Sen as a linear regression method with only a
 single explanatory variable. The work of Xin Dang, Hanxiang Peng,
 Xueqin Wang and Heping Zhang extend the method to n explanatory
 variables. So it should perfectly fit into the sklearn.linear_model
 subpackage I think. Where is the line drawn between functionality that
 should go into StatsModels and into Scikit-Learn with respect to
 regression methods?

 Florian

 On 10 January 2014 19:18, Skipper Seabold jsseab...@gmail.com wrote:
  Hi,
 
  There have been some implementations of Theil-Sen floating around for
 inclusion in statsmodels, but no PRs yet. IMO it might fit in a little
 better in statsmodels.robust than sklearn unless their are some aspects of
 Theil-Sen I'm not familiar with.
 
  Skipper
 
  Sent from my mobile
 
  On Jan 10, 2014, at 12:16 PM, florian.wilh...@gmail.com 
 florian.wilh...@gmail.com wrote:
 
  Hi,
 
  I'd like to add a Theil-Sen estimator for a multiple linear regression
  problem to Scikit-Learn as described in the paper:
  http://home.olemiss.edu/~xdang/papers/MTSE.pdf
  Is anyone already working on this or are there any objections
  regarding the inclusion of a Theil-Sen estimator into Scikit-Learn?
 
  Best regards,
 
  Florian Wilhelm
 
 
 --
  CenturyLink Cloud: The Leader in Enterprise Cloud Services.
  Learn Why More Businesses Are Choosing CenturyLink Cloud For
  Critical Workloads, Development Environments  Everything In Between.
  Get a Quote or Start a Free Trial Today.
 
 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 --
  CenturyLink Cloud: The Leader in Enterprise Cloud Services.
  Learn Why More Businesses Are Choosing CenturyLink Cloud For
  Critical Workloads, Development Environments  Everything In Between.
  Get a Quote or Start a Free Trial Today.
 
 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


 --
 CenturyLink Cloud: The Leader in Enterprise Cloud Services.
 Learn Why More Businesses Are Choosing CenturyLink Cloud For
 Critical Workloads, Development Environments  Everything In Between.
 Get a Quote or Start a Free Trial Today.

 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general