Re: [Scikit-learn-general] GSOC - Locality sensitive Hashing

2014-02-26 Thread Maheshakya Wijewardena
The method Bit sampling for Hamming distance is already included in
brute algorithm as the metric hamming in Nearest neighbor search.
Hence, I think that does not need to be implemented as a LSH algorithm.


On Wed, Feb 26, 2014 at 12:46 AM, Maheshakya Wijewardena 
pmaheshak...@gmail.com wrote:

 Approximating Nearest neighbor search is one of the application of
 locality sensitive hashing.There are five major methods.

- Bit sampling for Hamming distance
- Min-wise independent permutations
- Nilsimsa Hash
- Random projection
- Stable distributions

 Bit sampling method is fairly straight forward. A reference for the
 implementation of Random projection method can be taken from *lshash
 https://pypi.python.org/pypi/lshash* library.
 I'm looking forward to see comments for this from prospective mentors of
 this project.

 Thank you.
 Maheshakya.



 On Tue, Feb 25, 2014 at 8:24 AM, Maheshakya Wijewardena 
 pmaheshak...@gmail.com wrote:

 Hi,
 I have looked into this project idea. I have studied this method and I
 like to discuss further on this.
 I would like to know who the mentors for this project are and to get some
 insight on how to begin.

 Regards,
 Maheshakya,
 --
 Undergraduate,
 Department of Computer Science and Engineering,
 Faculty of Engineering.
 University of Moratuwa,
 Sri Lanka




 --
 Undergraduate,
 Department of Computer Science and Engineering,
 Faculty of Engineering.
 University of Moratuwa,
 Sri Lanka




-- 
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more

2014-02-26 Thread Lars Buitinck
2014-02-25 7:52 GMT+01:00 Gael Varoquaux gael.varoqu...@normalesup.org:
 Extreme learning machine: theory and applications has 1285 citations
 and it got published in 2006; a  large number of citations for a fairly
 recent article. I believe scikit-learn could add such an interesting
 learning algorithm along with its variations (weighted ELMs, sequential
 ELMS, etc.)

 It does sound like a possible candidate for inclusion.

We have a PR that implements them, but in too convoluted a way. My
personal choice for implementing these would be a transformer doing a
random projection + nonlinear activation. That way, you can stack any
linear model on top (think SGDClassifier for large-scale work) and get
a basic ELM. I've toyed with this variant before (typing this from
memory):

class RandomHiddenLayer(BaseEstimator, TransformerMixin):
def __init__(self, n_components=100, random_state=None):
self.n_components = n_components
self.random_state = random_state
def fit(self, X, y=None):
random_state = check_random_state(self.random_state)
self.components_ = random_state.randn(n_components, X.shape[1])
return self
def transform(self, X):
return np.tanh(safe_sparse_dot(X, self.components_.T))

Now, make_pipeline(RandomHiddenLayer(), SGDClassifier()) is an ELM
except with regularized hinge loss instead of least squares. I guess
LDA can be used to get the real ELM.

I recently implemented baseline RBF networks in pretty much the same
way: k-means + RBF kernel + linear classifier. I didn't submit a PR
because it's just a pipeline of existing components.

 Chances are the Multi-layer perceptron PR would be completed before the
 summer, so it won't be included in the GSoC proposal.

 In order not to get into a scope creep, I compiled the following list of
 algorithms to be proposed for the GSoC 2014,

 1) Extreme Learning Machines  
 (http://sentic.net/extreme-learning-machines.pdf)
 1a) Weighted Extreme Learning Machines
 1b) Sequential Extreme Learning machines

Does sequential mean for sequence data?

--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more

2014-02-26 Thread Gael Varoquaux
On Wed, Feb 26, 2014 at 01:29:43PM +0100, Lars Buitinck wrote:
 I recently implemented baseline RBF networks in pretty much the same
 way: k-means + RBF kernel + linear classifier. I didn't submit a PR
 because it's just a pipeline of existing components.

All your points about transformers and pipelines are true and good
points. Part of the work for 'deep learning' in scikit-learn is
documentation and example to exihibit these patterns better.

G

--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more

2014-02-26 Thread Vlad Niculae
On Wed Feb 26 13:32:08 2014, Gael Varoquaux wrote:
 documentation and example

This was exactly my thought.  Many such (near-)equivalences are not 
obvious, especially
for beginners.  If Lars's hinge ELM and RBF network would work well (or 
provide
interesting feature visualisations) on some sklearn.dataset, an example 
would be
very awesome.

The KMeans + sparse coding transformer that was lying around in a PR 
might also be
expressible as a pipeline I guess.

Vlad


 --
 Flow-based real-time traffic analytics software. Cisco certified tool.
 Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
 Customize your own dashboards, set traffic alerts and generate reports.
 Network behavioral analysis  security monitoring. All-in-one tool.
 http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] scikits.mixture.GMM.fit issue

2014-02-26 Thread Dmitry Svinkin
To scikit-learn-general,

I fit the bimodal 1D distribution with the strong overlap of Gaussian
components using scikits.mixture.GMM. The scikits.mixture.GMM.fit gives
result which is inconsistent with parameters of input distribution.

The code below demonstrates the issue.

In case the two components are well separated, for example (mu1 = -1.5 in
the code), the fit produces correct results.

I would be grateful for any information on constraints of
scikits.mixture.GMM.fit and on possibility to obtain appropriate results in
case of strong overlap of Gaussian components.

Sorry if this is not the appropriate mail list for such questions.

Best regards,

Dmitry


import numpy as np
from sklearn import mixture # sklearn v0.13.1

np.random.seed(1)

g = mixture.GMM(n_components=2, covariance_type='full')

n = 1
frac2 = 0.1

mu1 = -0.5
std1 = 0.5

mu2 = 0.0
std2 = 0.2

obs = np.concatenate( (np.random.normal(mu1, std1, np.int(n*(1-frac2))), \
np.random.normal(mu2, std2, np.int(n*frac2
g.fit(obs)

print 'fractions: '
print  np.round(g.weights_, 2)
print 'means: '
print np.round(g.means_, 2)
print 'stds: '
print np.round(np.sqrt(g._get_covars()), 2)

#output:
#fractions:
#[ 0.48  0.52]
#means:
#[[-0.74]
# [-0.18]]
#stds:
#[[[ 0.45]]
#
# [[ 0.4 ]]]
--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more

2014-02-26 Thread Lars Buitinck
2014-02-26 13:40 GMT+01:00 Vlad Niculae zephy...@gmail.com:
 This was exactly my thought.  Many such (near-)equivalences are not
 obvious, especially
 for beginners.  If Lars's hinge ELM and RBF network would work well (or
 provide
 interesting feature visualisations) on some sklearn.dataset, an example
 would be
 very awesome.

ELM on digits works extremely well: https://gist.github.com/larsmans/2493300

--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more

2014-02-26 Thread Gael Varoquaux
On Wed, Feb 26, 2014 at 03:42:50PM +0300, Issam wrote:
 Or perhaps special pipelines to simplify such common tasks.

I'd rather avoid special pipelines. For we, that would mean that we have
an API problem with the pipeline, that needs to be identified and
solved.

G

--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more

2014-02-26 Thread Lars Buitinck
2014-02-26 13:51 GMT+01:00 Gael Varoquaux gael.varoqu...@normalesup.org:
 On Wed, Feb 26, 2014 at 03:42:50PM +0300, Issam wrote:
 Or perhaps special pipelines to simplify such common tasks.

 I'd rather avoid special pipelines. For we, that would mean that we have
 an API problem with the pipeline, that needs to be identified and
 solved.

Well, for deep learning, you'd want a generalized backprop on the
final N steps, I guess :p

--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more

2014-02-26 Thread Gael Varoquaux
On Wed, Feb 26, 2014 at 01:55:11PM +0100, Lars Buitinck wrote:
  I'd rather avoid special pipelines. For we, that would mean that we have
  an API problem with the pipeline, that needs to be identified and
  solved.

 Well, for deep learning, you'd want a generalized backprop on the
 final N steps, I guess :p

OK. Point taken!

G

--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more

2014-02-26 Thread Mathieu Blondel
+1 for an RBF network transformer (with an option to choose between k-means
and random sampling).

Mathieu


On Wed, Feb 26, 2014 at 9:40 PM, Vlad Niculae zephy...@gmail.com wrote:

 On Wed Feb 26 13:32:08 2014, Gael Varoquaux wrote:
  documentation and example

 This was exactly my thought.  Many such (near-)equivalences are not
 obvious, especially
 for beginners.  If Lars's hinge ELM and RBF network would work well (or
 provide
 interesting feature visualisations) on some sklearn.dataset, an example
 would be
 very awesome.

 The KMeans + sparse coding transformer that was lying around in a PR
 might also be
 expressible as a pipeline I guess.

 Vlad

 
 
 --
  Flow-based real-time traffic analytics software. Cisco certified tool.
  Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
  Customize your own dashboards, set traffic alerts and generate reports.
  Network behavioral analysis  security monitoring. All-in-one tool.
 
 http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


 --
 Flow-based real-time traffic analytics software. Cisco certified tool.
 Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
 Customize your own dashboards, set traffic alerts and generate reports.
 Network behavioral analysis  security monitoring. All-in-one tool.

 http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more

2014-02-26 Thread federico vaggi
As an aside Lars - I'd actually love to see the recepy, if you don't mind
putting up a gist or notebook.


On Wed, Feb 26, 2014 at 1:29 PM, Lars Buitinck larsm...@gmail.com wrote:

 2014-02-25 7:52 GMT+01:00 Gael Varoquaux gael.varoqu...@normalesup.org:
  Extreme learning machine: theory and applications has 1285 citations
  and it got published in 2006; a  large number of citations for a fairly
  recent article. I believe scikit-learn could add such an interesting
  learning algorithm along with its variations (weighted ELMs, sequential
  ELMS, etc.)
 
  It does sound like a possible candidate for inclusion.

 We have a PR that implements them, but in too convoluted a way. My
 personal choice for implementing these would be a transformer doing a
 random projection + nonlinear activation. That way, you can stack any
 linear model on top (think SGDClassifier for large-scale work) and get
 a basic ELM. I've toyed with this variant before (typing this from
 memory):

 class RandomHiddenLayer(BaseEstimator, TransformerMixin):
 def __init__(self, n_components=100, random_state=None):
 self.n_components = n_components
 self.random_state = random_state
 def fit(self, X, y=None):
 random_state = check_random_state(self.random_state)
 self.components_ = random_state.randn(n_components, X.shape[1])
 return self
 def transform(self, X):
 return np.tanh(safe_sparse_dot(X, self.components_.T))

 Now, make_pipeline(RandomHiddenLayer(), SGDClassifier()) is an ELM
 except with regularized hinge loss instead of least squares. I guess
 LDA can be used to get the real ELM.

 I recently implemented baseline RBF networks in pretty much the same
 way: k-means + RBF kernel + linear classifier. I didn't submit a PR
 because it's just a pipeline of existing components.

  Chances are the Multi-layer perceptron PR would be completed before the
  summer, so it won't be included in the GSoC proposal.
 
  In order not to get into a scope creep, I compiled the following list of
  algorithms to be proposed for the GSoC 2014,
 
  1) Extreme Learning Machines  (
 http://sentic.net/extreme-learning-machines.pdf)
  1a) Weighted Extreme Learning Machines
  1b) Sequential Extreme Learning machines

 Does sequential mean for sequence data?


 --
 Flow-based real-time traffic analytics software. Cisco certified tool.
 Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
 Customize your own dashboards, set traffic alerts and generate reports.
 Network behavioral analysis  security monitoring. All-in-one tool.

 http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Saving Huge Models

2014-02-26 Thread Lorenzo Isella
Dear All,
I am using RandomForest on a data set which has less than 20 features, but  
about 40 lines.
The point is that, even if I work on  a subset of about 3 lines to  
train my model, when I save it using pickle, I get a large file in the  
order of several hundreds of Mb of space (see the snippet at the end of  
the email).
I can then later load the model by doing the following

In [8]: pkl_file = open(rf_wallmart_holidays.txt)

In [9]: clf = pickle.load(pkl_file)

In [10]: pkl_file.close()

However, I am concerned thay when I use the whole dataset, I will get a  
model size of the order of several Gb and I wonder if I will be able to  
load it via pickle as I do above.
I am just wondering if I am making any gross mistake (I have never used  
pickle in the past).
Any suggestions about efficient ways to store/read the models developed  
with sklearn is appreciated.
Regards

Lorenzo





clf = RandomForestRegressor(n_estimators=150,\
 # compute_importances = True, \
 n_jobs=2, verbose=3)

sales=train.Weekly_Sales

my_cols = set(train.columns)

my_cols.remove(Weekly_Sales)

my_cols = list(my_cols)

clf.fit(train[my_cols], sales)



f = open('rf_wallmart_non_holidays.txt','wb')
pickle.dump(clf,f)

--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Saving Huge Models

2014-02-26 Thread Olivier Grisel
You can control the size of your random forest by adjusting the
parameters n_estimators, min_samples_split and even max_depth (read
the documentation for more details).

It's up to you to find parameter values that match your constraints in
terms of accuracy vs model size in RAM and prediction speed.

To get slightly faster dumping and loading you can do:

from sklearn.externals import joblib

then save the model with:

joblib.dump(rf, filename)

Then later:

model = joblib.load(filename, mmap_mode='r')

Using the mmap_mode argument make it possible to share memory if you
have several python processes that need to load the same mode on the
same Linux / POSIX server (e.g. several Celery offline workers or
gunicorn + flask HTTP computing predictions in concurrently).


Also for regression or classification with a small number of tasks you
might want to try GradientBoostingRegressor/Classifier instead of RF:
you might get smaller models for similar predictive accuracy as the RF
models. Have a look at those slides for tricks to adjust Gradient
Boosting parameters:

http://orbi.ulg.ac.be/handle/2268/163521

-- 
Olivier

--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Saving Huge Models

2014-02-26 Thread Andy
On 02/26/2014 05:55 PM, Peter Prettenhofer wrote:

 please make sure to pickle with the highest protocol - otherwise 
 pickle uses a textual serialization format which is quite inefficient:

   pickle.dump(clf, f, protocol=pickle.HIGHEST_PROTOCOL)
Or simply protocol=-1. This usually makes a huge difference!

--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] GSOC - Locality sensitive Hashing

2014-02-26 Thread Andy

On 02/26/2014 10:13 AM, Maheshakya Wijewardena wrote:
The method Bit sampling for Hamming distance is already included in 
brute algorithm as the metric hamming in Nearest neighbor search. 
Hence, I think that does not need to be implemented as a LSH algorithm

I would also rather focus on non-binary representations.
There is no efficient way to work with binary data in numpy afaik -- at 
least none that is supported in sklearn.


I'm very interested in this project but unfortunately I don't have the 
time to mentor.


Cheers,
Andy



On Wed, Feb 26, 2014 at 12:46 AM, Maheshakya Wijewardena 
pmaheshak...@gmail.com mailto:pmaheshak...@gmail.com wrote:


Approximating Nearest neighbor search is one of the application of
locality sensitive hashing.There are five major methods.

  * Bit sampling for Hamming distance
  * Min-wise independent permutations
  * Nilsimsa Hash
  * Random projection
  * Stable distributions

Bit sampling method is fairly straight forward. A reference for
the implementation of Random projection method can be taken from
_lshash https://pypi.python.org/pypi/lshash_ library.
I'm looking forward to see comments for this from prospective
mentors of this project.

Thank you.
Maheshakya.



On Tue, Feb 25, 2014 at 8:24 AM, Maheshakya Wijewardena
pmaheshak...@gmail.com mailto:pmaheshak...@gmail.com wrote:

Hi,
I have looked into this project idea. I have studied this
method and I like to discuss further on this.
I would like to know who the mentors for this project are and
to get some insight on how to begin.

Regards,
Maheshakya,
-- 
Undergraduate,

Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka




-- 
Undergraduate,

Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka




--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka


--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] GSOC - Locality sensitive Hashing

2014-02-26 Thread Maheshakya Wijewardena

 I would also rather focus on non-binary representations.


Even when using Random Projection method for hashing, only sign of the
result of dot product is considered. So that, in that situation also, there
will be a binary representation( or +1s and -1s). What is your idea about
this method?

Nearest neighbor search has been implemented in Scikit-learn in
sklearn.neighbors. In unsupervised.py, NeighborsBase class is used
and NeighborsBase (in base.py) contains following methods to perform the
search.

   - brute  - a brute force linear search
   - kd_tree - KD tree search
   - ball_tree - binary tree search

So we can add LSH based search as another algorithm type in
NearestNeighbors.

In order to perform neighbor search using LSH, those hashing methods should
be implemented separately(In another file). There will be multiple hash
tables built by concatenating hash functions. Here, I notice an issue. As
we generated a significantly large number of hash tables, there must be a
way to store them efficiently. Is there a way to do this in the
Scikit-learn way? This part will also have to be implemented outside the
NeighborBase class.
The logic for performing the search using computed computed hash tables
should be included in the NeighborBase.

This is my basic opinion on how to implement LSH based neighbor search in
Scikit-learn. Your feedback and suggestions for improvements are welcome. [?]

Regards,
Maheshakya.


On Thu, Feb 27, 2014 at 12:28 AM, Andy t3k...@gmail.com wrote:

  On 02/26/2014 10:13 AM, Maheshakya Wijewardena wrote:

 The method Bit sampling for Hamming distance is already included in
 brute algorithm as the metric hamming in Nearest neighbor search.
 Hence, I think that does not need to be implemented as a LSH algorithm

 I would also rather focus on non-binary representations.
 There is no efficient way to work with binary data in numpy afaik -- at
 least none that is supported in sklearn.

 I'm very interested in this project but unfortunately I don't have the
 time to mentor.

 Cheers,
 Andy


 On Wed, Feb 26, 2014 at 12:46 AM, Maheshakya Wijewardena 
 pmaheshak...@gmail.com wrote:

 Approximating Nearest neighbor search is one of the application of
 locality sensitive hashing.There are five major methods.

- Bit sampling for Hamming distance
 - Min-wise independent permutations
 - Nilsimsa Hash
 - Random projection
 - Stable distributions

 Bit sampling method is fairly straight forward. A reference for the
 implementation of Random projection method can be taken from *lshash
 https://pypi.python.org/pypi/lshash* library.
  I'm looking forward to see comments for this from prospective mentors
 of this project.

  Thank you.
  Maheshakya.



  On Tue, Feb 25, 2014 at 8:24 AM, Maheshakya Wijewardena 
 pmaheshak...@gmail.com wrote:

 Hi,
 I have looked into this project idea. I have studied this method and I
 like to discuss further on this.
  I would like to know who the mentors for this project are and to get
 some insight on how to begin.

  Regards,
 Maheshakya,
 --
 Undergraduate,
 Department of Computer Science and Engineering,
 Faculty of Engineering.
 University of Moratuwa,
  Sri Lanka




  --
 Undergraduate,
 Department of Computer Science and Engineering,
 Faculty of Engineering.
 University of Moratuwa,
  Sri Lanka




  --
 Undergraduate,
 Department of Computer Science and Engineering,
 Faculty of Engineering.
 University of Moratuwa,
  Sri Lanka


 --
 Flow-based real-time traffic analytics software. Cisco certified tool.
 Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
 Customize your own dashboards, set traffic alerts and generate reports.
 Network behavioral analysis  security monitoring. All-in-one 
 tool.http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Flow-based real-time traffic analytics software. Cisco certified tool.
 Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
 Customize your own dashboards, set traffic alerts and generate reports.
 Network behavioral analysis  security monitoring. All-in-one tool.

 http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
330.png--
Flow-based real-time traffic analytics 

[Scikit-learn-general] extra trees, oob score vs shufflesplit

2014-02-26 Thread Satrajit Ghosh
hi folks,

when using extra trees, one can compute an oob score. has anybody looked at
comparing the oob_score to performing a shufflesplit iteration on the data?

are these in someways equivalent or would converge to the same mean?

cheers,

satra
--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] marking review status of PRs

2014-02-26 Thread Joel Nothman
We seem to have a lot of PRs waiting for review in some form or another. I
think they could do with better management.

Can we use github features to make it more apparent that a PR has received
+1 (i.e. needs another reviewer) or +2 (i.e. waiting for merge)?

At the moment, [WIP] and [MRG] are marked in the PR title to similar
effect, and we could introduce [MRG+1] and [MRG+2] (although these may only
be changed by the submitter and repo collabs). One annoyance is that
github's search query tokenization means that a query like MRG+1 or MRG+1
doesn't match correctly.

We could also use Github's Labels to make them searchable, but then it's up
to repo collabs to maintain the status.

Or maybe this is a bad idea because it makes the consensus too formal...

- Joel
--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] marking review status of PRs

2014-02-26 Thread Alexandre Gramfort
Hi,

I like the  [MRG+1] and [MRG+2] idea. Let's see if it can help...

Best,
A

--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Combine functionality for text feature/image feature pipeline

2014-02-26 Thread Alexandre Gramfort
hi,

do you know:

http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html

?

it might do already what you want

A


On Thu, Feb 27, 2014 at 8:33 AM, michael kneier
michael.kne...@gmail.com wrote:
 Hi all,

 I would like to add a combiner class which would work with pipeline to 
 allow users to augment the output of scikit's text feature extraction process 
 (or other feature extraction processes). For example, after apply 
 CountVectorizer, it is sometime desirable to augment the resulting dataset 
 with additional features. Unless I am missing something, this is not easily 
 done if the count vectorization is being used in a pipeline, especially if 
 CountVectorizer parameters such as min_df are being optimized along with 
 downstream model parameters.

 After I have written code for this class, what is the easiest way to get it 
 reviewed/incorporated into scikit?

 Thanks,
 Mike Kneier



 --
 Flow-based real-time traffic analytics software. Cisco certified tool.
 Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
 Customize your own dashboards, set traffic alerts and generate reports.
 Network behavioral analysis  security monitoring. All-in-one tool.
 http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis  security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general