Re: [scikit-learn] partial_fit implementation for IsolationForest

2016-07-01 Thread donkey-hotei
hi Olivier, thanks for your response. What you describe is quite different from what sklearn models typically do with partial_fit. partial_fit is more about out-of-core / streaming fitting rather than true online learning with explicit forgetting. In particular what you suggest would not accep

[scikit-learn] Adding BM25 to sklearn.feature_extraction.text

2016-07-01 Thread Basil Beirouti
se matrix still takes less time than 3, and takes about as long as 2. > > > > So my question is, how important is it that my BM25Transformer outputs a > > sparse matrix? > > > > I'm going to try another implementation which looks direc

[scikit-learn] Adding BM25 to scikit-learn.feature_extraction.text

2016-07-01 Thread Basil Beirouti
Hi everyone, to put it succinctly, here's the BM25 equation: f(w,D) * (k+1) / (k*B + f(w,D)) where w is the word, and D is the document (corresponding to rows and columns, respectively). f is a sparse matrix because only a fraction of the whole vocabulary of words appears in any given single doc

Re: [scikit-learn] Adding BM25 to scikit-learn.feature_extraction.text

2016-07-01 Thread Vlad Niculae
Hi Basil, If B were just a constant, you could do the whole thing as a vectorized operation on X.data. Since I understand B is a n_samples vector, I think the cleanest way to compute the denominator is using sklearn.utils.sparsefuncs.inplace_row_scale. Hope this helps, Vlad On July 1, 2016

[scikit-learn] Bm25

2016-07-01 Thread Basil Beirouti
, and see if it's possible to create a copy of > .data attribute and update the values accordingly. I was hoping > somebody had encountered this type of issue before. > > Sincerely, > > Basil Beirouti > -- next part -- > An HTML attachment was scrubbe

Re: [scikit-learn] Bm25

2016-07-01 Thread Vlad Niculae
create a new copy of either a dok sparse >> matrix or a regular numpy array and assign to that. >> >> I could also deal directly with the .data, .indptr, and indices >> attributes of csr_matrix, and see if it's possible to create a copy >of >> .data attribute

Re: [scikit-learn] Bm25

2016-07-01 Thread Basil Beirouti
h is bad (because of dividing by zero). >>> >>> So anyway, currently I am converting to a coo_matrix and iterator through >>> the non-zero values like this: >>> >>> cx = x.tocoo() >>> for i,j,v in itertools.izip(cx.row, cx.co

Re: [scikit-learn] Bm25

2016-07-01 Thread Vlad Niculae
>a >>>> denominator, which is bad (because of dividing by zero). >>>> >>>> So anyway, currently I am converting to a coo_matrix and iterator >through >>>> the non-zero values like this: >>>> >>>>

Re: [scikit-learn] Bm25

2016-07-01 Thread Vlad Niculae
>a >>>> denominator, which is bad (because of dividing by zero). >>>> >>>> So anyway, currently I am converting to a coo_matrix and iterator >through >>>> the non-zero values like this: >>>> >>>>