Re: [Scikit-learn-general] GSOC - Locality sensitive Hashing
The method Bit sampling for Hamming distance is already included in brute algorithm as the metric hamming in Nearest neighbor search. Hence, I think that does not need to be implemented as a LSH algorithm. On Wed, Feb 26, 2014 at 12:46 AM, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: Approximating Nearest neighbor search is one of the application of locality sensitive hashing.There are five major methods. - Bit sampling for Hamming distance - Min-wise independent permutations - Nilsimsa Hash - Random projection - Stable distributions Bit sampling method is fairly straight forward. A reference for the implementation of Random projection method can be taken from *lshash https://pypi.python.org/pypi/lshash* library. I'm looking forward to see comments for this from prospective mentors of this project. Thank you. Maheshakya. On Tue, Feb 25, 2014 at 8:24 AM, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: Hi, I have looked into this project idea. I have studied this method and I like to discuss further on this. I would like to know who the mentors for this project are and to get some insight on how to begin. Regards, Maheshakya, -- Undergraduate, Department of Computer Science and Engineering, Faculty of Engineering. University of Moratuwa, Sri Lanka -- Undergraduate, Department of Computer Science and Engineering, Faculty of Engineering. University of Moratuwa, Sri Lanka -- Undergraduate, Department of Computer Science and Engineering, Faculty of Engineering. University of Moratuwa, Sri Lanka -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more
2014-02-25 7:52 GMT+01:00 Gael Varoquaux gael.varoqu...@normalesup.org: Extreme learning machine: theory and applications has 1285 citations and it got published in 2006; a large number of citations for a fairly recent article. I believe scikit-learn could add such an interesting learning algorithm along with its variations (weighted ELMs, sequential ELMS, etc.) It does sound like a possible candidate for inclusion. We have a PR that implements them, but in too convoluted a way. My personal choice for implementing these would be a transformer doing a random projection + nonlinear activation. That way, you can stack any linear model on top (think SGDClassifier for large-scale work) and get a basic ELM. I've toyed with this variant before (typing this from memory): class RandomHiddenLayer(BaseEstimator, TransformerMixin): def __init__(self, n_components=100, random_state=None): self.n_components = n_components self.random_state = random_state def fit(self, X, y=None): random_state = check_random_state(self.random_state) self.components_ = random_state.randn(n_components, X.shape[1]) return self def transform(self, X): return np.tanh(safe_sparse_dot(X, self.components_.T)) Now, make_pipeline(RandomHiddenLayer(), SGDClassifier()) is an ELM except with regularized hinge loss instead of least squares. I guess LDA can be used to get the real ELM. I recently implemented baseline RBF networks in pretty much the same way: k-means + RBF kernel + linear classifier. I didn't submit a PR because it's just a pipeline of existing components. Chances are the Multi-layer perceptron PR would be completed before the summer, so it won't be included in the GSoC proposal. In order not to get into a scope creep, I compiled the following list of algorithms to be proposed for the GSoC 2014, 1) Extreme Learning Machines (http://sentic.net/extreme-learning-machines.pdf) 1a) Weighted Extreme Learning Machines 1b) Sequential Extreme Learning machines Does sequential mean for sequence data? -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more
On Wed, Feb 26, 2014 at 01:29:43PM +0100, Lars Buitinck wrote: I recently implemented baseline RBF networks in pretty much the same way: k-means + RBF kernel + linear classifier. I didn't submit a PR because it's just a pipeline of existing components. All your points about transformers and pipelines are true and good points. Part of the work for 'deep learning' in scikit-learn is documentation and example to exihibit these patterns better. G -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more
On Wed Feb 26 13:32:08 2014, Gael Varoquaux wrote: documentation and example This was exactly my thought. Many such (near-)equivalences are not obvious, especially for beginners. If Lars's hinge ELM and RBF network would work well (or provide interesting feature visualisations) on some sklearn.dataset, an example would be very awesome. The KMeans + sparse coding transformer that was lying around in a PR might also be expressible as a pipeline I guess. Vlad -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] scikits.mixture.GMM.fit issue
To scikit-learn-general, I fit the bimodal 1D distribution with the strong overlap of Gaussian components using scikits.mixture.GMM. The scikits.mixture.GMM.fit gives result which is inconsistent with parameters of input distribution. The code below demonstrates the issue. In case the two components are well separated, for example (mu1 = -1.5 in the code), the fit produces correct results. I would be grateful for any information on constraints of scikits.mixture.GMM.fit and on possibility to obtain appropriate results in case of strong overlap of Gaussian components. Sorry if this is not the appropriate mail list for such questions. Best regards, Dmitry import numpy as np from sklearn import mixture # sklearn v0.13.1 np.random.seed(1) g = mixture.GMM(n_components=2, covariance_type='full') n = 1 frac2 = 0.1 mu1 = -0.5 std1 = 0.5 mu2 = 0.0 std2 = 0.2 obs = np.concatenate( (np.random.normal(mu1, std1, np.int(n*(1-frac2))), \ np.random.normal(mu2, std2, np.int(n*frac2 g.fit(obs) print 'fractions: ' print np.round(g.weights_, 2) print 'means: ' print np.round(g.means_, 2) print 'stds: ' print np.round(np.sqrt(g._get_covars()), 2) #output: #fractions: #[ 0.48 0.52] #means: #[[-0.74] # [-0.18]] #stds: #[[[ 0.45]] # # [[ 0.4 ]]] -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more
2014-02-26 13:40 GMT+01:00 Vlad Niculae zephy...@gmail.com: This was exactly my thought. Many such (near-)equivalences are not obvious, especially for beginners. If Lars's hinge ELM and RBF network would work well (or provide interesting feature visualisations) on some sklearn.dataset, an example would be very awesome. ELM on digits works extremely well: https://gist.github.com/larsmans/2493300 -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more
On Wed, Feb 26, 2014 at 03:42:50PM +0300, Issam wrote: Or perhaps special pipelines to simplify such common tasks. I'd rather avoid special pipelines. For we, that would mean that we have an API problem with the pipeline, that needs to be identified and solved. G -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more
2014-02-26 13:51 GMT+01:00 Gael Varoquaux gael.varoqu...@normalesup.org: On Wed, Feb 26, 2014 at 03:42:50PM +0300, Issam wrote: Or perhaps special pipelines to simplify such common tasks. I'd rather avoid special pipelines. For we, that would mean that we have an API problem with the pipeline, that needs to be identified and solved. Well, for deep learning, you'd want a generalized backprop on the final N steps, I guess :p -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more
On Wed, Feb 26, 2014 at 01:55:11PM +0100, Lars Buitinck wrote: I'd rather avoid special pipelines. For we, that would mean that we have an API problem with the pipeline, that needs to be identified and solved. Well, for deep learning, you'd want a generalized backprop on the final N steps, I guess :p OK. Point taken! G -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more
+1 for an RBF network transformer (with an option to choose between k-means and random sampling). Mathieu On Wed, Feb 26, 2014 at 9:40 PM, Vlad Niculae zephy...@gmail.com wrote: On Wed Feb 26 13:32:08 2014, Gael Varoquaux wrote: documentation and example This was exactly my thought. Many such (near-)equivalences are not obvious, especially for beginners. If Lars's hinge ELM and RBF network would work well (or provide interesting feature visualisations) on some sklearn.dataset, an example would be very awesome. The KMeans + sparse coding transformer that was lying around in a PR might also be expressible as a pipeline I guess. Vlad -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more
As an aside Lars - I'd actually love to see the recepy, if you don't mind putting up a gist or notebook. On Wed, Feb 26, 2014 at 1:29 PM, Lars Buitinck larsm...@gmail.com wrote: 2014-02-25 7:52 GMT+01:00 Gael Varoquaux gael.varoqu...@normalesup.org: Extreme learning machine: theory and applications has 1285 citations and it got published in 2006; a large number of citations for a fairly recent article. I believe scikit-learn could add such an interesting learning algorithm along with its variations (weighted ELMs, sequential ELMS, etc.) It does sound like a possible candidate for inclusion. We have a PR that implements them, but in too convoluted a way. My personal choice for implementing these would be a transformer doing a random projection + nonlinear activation. That way, you can stack any linear model on top (think SGDClassifier for large-scale work) and get a basic ELM. I've toyed with this variant before (typing this from memory): class RandomHiddenLayer(BaseEstimator, TransformerMixin): def __init__(self, n_components=100, random_state=None): self.n_components = n_components self.random_state = random_state def fit(self, X, y=None): random_state = check_random_state(self.random_state) self.components_ = random_state.randn(n_components, X.shape[1]) return self def transform(self, X): return np.tanh(safe_sparse_dot(X, self.components_.T)) Now, make_pipeline(RandomHiddenLayer(), SGDClassifier()) is an ELM except with regularized hinge loss instead of least squares. I guess LDA can be used to get the real ELM. I recently implemented baseline RBF networks in pretty much the same way: k-means + RBF kernel + linear classifier. I didn't submit a PR because it's just a pipeline of existing components. Chances are the Multi-layer perceptron PR would be completed before the summer, so it won't be included in the GSoC proposal. In order not to get into a scope creep, I compiled the following list of algorithms to be proposed for the GSoC 2014, 1) Extreme Learning Machines ( http://sentic.net/extreme-learning-machines.pdf) 1a) Weighted Extreme Learning Machines 1b) Sequential Extreme Learning machines Does sequential mean for sequence data? -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Saving Huge Models
Dear All, I am using RandomForest on a data set which has less than 20 features, but about 40 lines. The point is that, even if I work on a subset of about 3 lines to train my model, when I save it using pickle, I get a large file in the order of several hundreds of Mb of space (see the snippet at the end of the email). I can then later load the model by doing the following In [8]: pkl_file = open(rf_wallmart_holidays.txt) In [9]: clf = pickle.load(pkl_file) In [10]: pkl_file.close() However, I am concerned thay when I use the whole dataset, I will get a model size of the order of several Gb and I wonder if I will be able to load it via pickle as I do above. I am just wondering if I am making any gross mistake (I have never used pickle in the past). Any suggestions about efficient ways to store/read the models developed with sklearn is appreciated. Regards Lorenzo clf = RandomForestRegressor(n_estimators=150,\ # compute_importances = True, \ n_jobs=2, verbose=3) sales=train.Weekly_Sales my_cols = set(train.columns) my_cols.remove(Weekly_Sales) my_cols = list(my_cols) clf.fit(train[my_cols], sales) f = open('rf_wallmart_non_holidays.txt','wb') pickle.dump(clf,f) -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Saving Huge Models
You can control the size of your random forest by adjusting the parameters n_estimators, min_samples_split and even max_depth (read the documentation for more details). It's up to you to find parameter values that match your constraints in terms of accuracy vs model size in RAM and prediction speed. To get slightly faster dumping and loading you can do: from sklearn.externals import joblib then save the model with: joblib.dump(rf, filename) Then later: model = joblib.load(filename, mmap_mode='r') Using the mmap_mode argument make it possible to share memory if you have several python processes that need to load the same mode on the same Linux / POSIX server (e.g. several Celery offline workers or gunicorn + flask HTTP computing predictions in concurrently). Also for regression or classification with a small number of tasks you might want to try GradientBoostingRegressor/Classifier instead of RF: you might get smaller models for similar predictive accuracy as the RF models. Have a look at those slides for tricks to adjust Gradient Boosting parameters: http://orbi.ulg.ac.be/handle/2268/163521 -- Olivier -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Saving Huge Models
On 02/26/2014 05:55 PM, Peter Prettenhofer wrote: please make sure to pickle with the highest protocol - otherwise pickle uses a textual serialization format which is quite inefficient: pickle.dump(clf, f, protocol=pickle.HIGHEST_PROTOCOL) Or simply protocol=-1. This usually makes a huge difference! -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GSOC - Locality sensitive Hashing
On 02/26/2014 10:13 AM, Maheshakya Wijewardena wrote: The method Bit sampling for Hamming distance is already included in brute algorithm as the metric hamming in Nearest neighbor search. Hence, I think that does not need to be implemented as a LSH algorithm I would also rather focus on non-binary representations. There is no efficient way to work with binary data in numpy afaik -- at least none that is supported in sklearn. I'm very interested in this project but unfortunately I don't have the time to mentor. Cheers, Andy On Wed, Feb 26, 2014 at 12:46 AM, Maheshakya Wijewardena pmaheshak...@gmail.com mailto:pmaheshak...@gmail.com wrote: Approximating Nearest neighbor search is one of the application of locality sensitive hashing.There are five major methods. * Bit sampling for Hamming distance * Min-wise independent permutations * Nilsimsa Hash * Random projection * Stable distributions Bit sampling method is fairly straight forward. A reference for the implementation of Random projection method can be taken from _lshash https://pypi.python.org/pypi/lshash_ library. I'm looking forward to see comments for this from prospective mentors of this project. Thank you. Maheshakya. On Tue, Feb 25, 2014 at 8:24 AM, Maheshakya Wijewardena pmaheshak...@gmail.com mailto:pmaheshak...@gmail.com wrote: Hi, I have looked into this project idea. I have studied this method and I like to discuss further on this. I would like to know who the mentors for this project are and to get some insight on how to begin. Regards, Maheshakya, -- Undergraduate, Department of Computer Science and Engineering, Faculty of Engineering. University of Moratuwa, Sri Lanka -- Undergraduate, Department of Computer Science and Engineering, Faculty of Engineering. University of Moratuwa, Sri Lanka -- Undergraduate, Department of Computer Science and Engineering, Faculty of Engineering. University of Moratuwa, Sri Lanka -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GSOC - Locality sensitive Hashing
I would also rather focus on non-binary representations. Even when using Random Projection method for hashing, only sign of the result of dot product is considered. So that, in that situation also, there will be a binary representation( or +1s and -1s). What is your idea about this method? Nearest neighbor search has been implemented in Scikit-learn in sklearn.neighbors. In unsupervised.py, NeighborsBase class is used and NeighborsBase (in base.py) contains following methods to perform the search. - brute - a brute force linear search - kd_tree - KD tree search - ball_tree - binary tree search So we can add LSH based search as another algorithm type in NearestNeighbors. In order to perform neighbor search using LSH, those hashing methods should be implemented separately(In another file). There will be multiple hash tables built by concatenating hash functions. Here, I notice an issue. As we generated a significantly large number of hash tables, there must be a way to store them efficiently. Is there a way to do this in the Scikit-learn way? This part will also have to be implemented outside the NeighborBase class. The logic for performing the search using computed computed hash tables should be included in the NeighborBase. This is my basic opinion on how to implement LSH based neighbor search in Scikit-learn. Your feedback and suggestions for improvements are welcome. [?] Regards, Maheshakya. On Thu, Feb 27, 2014 at 12:28 AM, Andy t3k...@gmail.com wrote: On 02/26/2014 10:13 AM, Maheshakya Wijewardena wrote: The method Bit sampling for Hamming distance is already included in brute algorithm as the metric hamming in Nearest neighbor search. Hence, I think that does not need to be implemented as a LSH algorithm I would also rather focus on non-binary representations. There is no efficient way to work with binary data in numpy afaik -- at least none that is supported in sklearn. I'm very interested in this project but unfortunately I don't have the time to mentor. Cheers, Andy On Wed, Feb 26, 2014 at 12:46 AM, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: Approximating Nearest neighbor search is one of the application of locality sensitive hashing.There are five major methods. - Bit sampling for Hamming distance - Min-wise independent permutations - Nilsimsa Hash - Random projection - Stable distributions Bit sampling method is fairly straight forward. A reference for the implementation of Random projection method can be taken from *lshash https://pypi.python.org/pypi/lshash* library. I'm looking forward to see comments for this from prospective mentors of this project. Thank you. Maheshakya. On Tue, Feb 25, 2014 at 8:24 AM, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: Hi, I have looked into this project idea. I have studied this method and I like to discuss further on this. I would like to know who the mentors for this project are and to get some insight on how to begin. Regards, Maheshakya, -- Undergraduate, Department of Computer Science and Engineering, Faculty of Engineering. University of Moratuwa, Sri Lanka -- Undergraduate, Department of Computer Science and Engineering, Faculty of Engineering. University of Moratuwa, Sri Lanka -- Undergraduate, Department of Computer Science and Engineering, Faculty of Engineering. University of Moratuwa, Sri Lanka -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool.http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Undergraduate, Department of Computer Science and Engineering, Faculty of Engineering. University of Moratuwa, Sri Lanka 330.png-- Flow-based real-time traffic analytics
[Scikit-learn-general] extra trees, oob score vs shufflesplit
hi folks, when using extra trees, one can compute an oob score. has anybody looked at comparing the oob_score to performing a shufflesplit iteration on the data? are these in someways equivalent or would converge to the same mean? cheers, satra -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] marking review status of PRs
We seem to have a lot of PRs waiting for review in some form or another. I think they could do with better management. Can we use github features to make it more apparent that a PR has received +1 (i.e. needs another reviewer) or +2 (i.e. waiting for merge)? At the moment, [WIP] and [MRG] are marked in the PR title to similar effect, and we could introduce [MRG+1] and [MRG+2] (although these may only be changed by the submitter and repo collabs). One annoyance is that github's search query tokenization means that a query like MRG+1 or MRG+1 doesn't match correctly. We could also use Github's Labels to make them searchable, but then it's up to repo collabs to maintain the status. Or maybe this is a bad idea because it makes the consensus too formal... - Joel -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] marking review status of PRs
Hi, I like the [MRG+1] and [MRG+2] idea. Let's see if it can help... Best, A -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Combine functionality for text feature/image feature pipeline
hi, do you know: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html ? it might do already what you want A On Thu, Feb 27, 2014 at 8:33 AM, michael kneier michael.kne...@gmail.com wrote: Hi all, I would like to add a combiner class which would work with pipeline to allow users to augment the output of scikit's text feature extraction process (or other feature extraction processes). For example, after apply CountVectorizer, it is sometime desirable to augment the resulting dataset with additional features. Unless I am missing something, this is not easily done if the count vectorization is being used in a pipeline, especially if CountVectorizer parameters such as min_df are being optimized along with downstream model parameters. After I have written code for this class, what is the easiest way to get it reviewed/incorporated into scikit? Thanks, Mike Kneier -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general