[scikit-learn] One-hot encoding

2018-02-04 Thread Sarah Wait Zaranek
Hello - I was just wondering if there was a way to improve performance on the one-hot encoder. Or, is there any plans to do so in the future? I am working with a matrix that will ultimately have 20 million categorical variables, and my bottleneck is the one-hot encoder. Let me know if this

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Joel Nothman
20 million categories, or 20 million categorical variables? OneHotEncoder is pretty efficient if you specify n_values. On 5 February 2018 at 15:10, Sarah Wait Zaranek wrote: > Hello - > > I was just wondering if there was a way to improve performance on the > one-hot

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Sarah Wait Zaranek
​Sorry - your second message popped up when I was writing my response. I will look at this as well. Thanks for being so speedy! Cheers, Sarah​ On Sun, Feb 4, 2018 at 11:30 PM, Joel Nothman wrote: > You will also benefit from assume_finite (see http://scikit-learn.org/

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Joel Nothman
If each input column is encoded as a value from 0 to the (number of possible values for that column - 1) then n_values for that column should be the highest value + 1, which is also the number of levels per column. Does that make sense? Actually, I've realised there's a somewhat slow and

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Joel Nothman
If you specify n_values=[list_of_vals_for_column1, list_of_vals_for_column2], you should be able to engineer it to how you want. On 5 February 2018 at 16:31, Sarah Wait Zaranek wrote: > If I use the n+1 approach, then I get the correct matrix, except with the > columns

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Sarah Wait Zaranek
Hi Joel - 20 million categorical variables. It comes from segmenting the genome into 20 million parts. Genomes are big :) For n_values, I am a bit confused. Is the input the same as the output for n values. Originally, I thought it was just the number of levels per column, but it seems like

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Joel Nothman
You will also benefit from assume_finite (see http://scikit-learn.org/stable/modules/generated/sklearn.config_context.html ) ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Sarah Wait Zaranek
Hi Joel - Conceptually, that makes sense. But when I assign n_values, I can't make it match the result when you don't specify them. See below. I used the number of unique levels per column. >>> enc = OneHotEncoder(sparse=False) >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1,

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Sarah Wait Zaranek
If I use the n+1 approach, then I get the correct matrix, except with the columns of zeros: >>> test array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Sarah Wait Zaranek
Great. Thank you for all your help. Cheers, Sarah On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman wrote: > If you specify n_values=[list_of_vals_for_column1, > list_of_vals_for_column2], you should be able to engineer it to how you > want. > > On 5 February 2018 at