Re: [Scikit-learn-general] Latent Dirichlet Allocation

2016-01-26 Thread Rockenkamm, Christian
-learn-general Betreff: Re: [Scikit-learn-general] Latent Dirichlet Allocation How many distinct words are in your dataset? On 27 January 2016 at 00:21, Rockenkamm, Christian mailto:[email protected]>> wrote: Hallo, I have question concerning the Latent Dirichlet Allocatio

[Scikit-learn-general] Latent Dirichlet Allocation

2016-01-26 Thread Rockenkamm, Christian
affected, depending on the parameter setting. Does anybody have an idea as to what might be causing this problem and how to resolve it? Best regards, Christian Rockenkamm -- Site24x7 APM Insight: Get Deep Visibility into

[Scikit-learn-general] Latent Dirichlet Allocation

2016-01-26 Thread Rockenkamm, Christian
affected, depending on the parameter setting. Does anybody have an idea as to what might be causing this problem and how to resolve it? Best regards, Christian Rockenkamm -- Site24x7 APM Insight: Get Deep Visibility into

[Scikit-learn-general] Latent Dirichlet Allocation topic-word-matrix and the document-topic-matrix

2015-12-08 Thread Rockenkamm, Christian
Hello, I have a short question concerning the Latent Dirichlet Allocation in scikit. Is it possible to acquire the topic-word-matrix and the document-topic-matrix? If so, could someone please explain to me how to do that? Best regards, Christian Rockenkamm

[Scikit-learn-general] multiclass classification ~ 1500 categories

2014-07-30 Thread Christian Schulz
least for prototyping (1) No need to organize this huge amount of models in a database (serialization) (2) Comparability between the scores Disadvantage: (1) Difficult to adjust/weighting the outcome Many thanks Christian

Re: [Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15

2014-07-21 Thread Christian Jauvin
> What I had in mind (for the LB) was an option to "reserve" an extra > column at the LB creation, which could then be used to map all the > unknown values further encountered by "transform". This column would > obviously be all zeros in the matrix returned by "fit_transform" (i.e. > could only con

Re: [Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15

2014-07-17 Thread Christian Jauvin
> I think the encoders should all be able to deal with unknown labels. > The thing about the extra single value is that you don't have a column > to map it to. > How would you use the extra value in LabelBinarizer or OneHotEncoder? You're right, and this points to a difference between what PR #324

Re: [Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15

2014-07-16 Thread Christian Jauvin
an issue on github? > > I am not sure that it would make sense to add a unknown columns > label with an optional parameter. But you could easily add one with > some numpy operations > > np.hstack([y, y.sum(axis=1,keepdims=True) == 0]) > > Best regards, > Arnaud > > >

[Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15

2014-07-16 Thread Christian Jauvin
e.net/p/scikit-learn/mailman/message/31827616/ So if my understanding of this mechanism is correct (as well as my assumptions about the way it is/should be used), would it make sense to add something like a "map_unknowns_to_single_class" extra parameter to all the preprocessing encod

Re: [Scikit-learn-general] Difference between sklearn.feature_selection.chi2 and scipy.stats.chi2_contingency

2014-06-30 Thread Christian Jauvin
> If I understand you correctly, one way to reconcile the difference > between the two interpretations (multinomial vs binomial) would be to > binarize first my boolean input variable: Just for the sake of clarity: I meant to add the complement to my input variable (i.e. as a second feature), rath

Re: [Scikit-learn-general] Difference between sklearn.feature_selection.chi2 and scipy.stats.chi2_contingency

2014-06-30 Thread Christian Jauvin
enever one tries to use it the way I did (i.e. assuming a binomial event model), he would silently obtain wrong results? Isn't there a use for the binomial case? Thanks, Christian -- Open source business pr

[Scikit-learn-general] Difference between sklearn.feature_selection.chi2 and scipy.stats.chi2_contingency

2014-06-29 Thread Christian Jauvin
array([[ 15., 10.], [ 45., 30.]])) What explains the difference in terms of the Chi-Square value (0.5 vs 2) and the P-value (0.48 vs 0.157)? Thanks, Christian -- Open source business process management suite buil

Re: [Scikit-learn-general] Similarity in a vector space model with skewed feature distribution

2014-04-23 Thread Christian Jauvin
y measure, or dealing with > large quantities of sparse data in a memory efficient way? If it is the > latter, you can look into feature hashing: > http://en.wikipedia.org/wiki/Feature_hashing > > regards > shankar. > > > > > On Wed, Apr 23, 2014 at 9:59 AM, Ch

[Scikit-learn-general] Similarity in a vector space model with skewed feature distribution

2014-04-22 Thread Christian Jauvin
the very skewed distribution. I'd greatly appreciate any idea or suggestion about this problem. Thanks, Christian -- Start Your Social Network Today - Download eXo Platform Build your Enterprise Intranet with eXo Platform

Re: [Scikit-learn-general] LabelEncoder with never seen before values

2014-01-11 Thread Christian Jauvin
(which is what I assume because it can be considered I guess as a form of data leakage), what is the standard way to solve the issue of test values (for a categorical variable) that have never been encountered in the training set? On 9 January 2014 15:21, Christian Jauvin wrote: > Hi, > >

[Scikit-learn-general] LabelEncoder with never seen before values

2014-01-09 Thread Christian Jauvin
Hi, If a LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set. The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to "", and then explicitly add a corresp

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-06 Thread Christian Jauvin
>> I believe more in my results than in my expertise - and so should you :-) > > +1! There's very very few examples of theory trumping data in history... And > a bajillion of the converse. I guess I didn't express myself clearly: I didn't mean to say that I mistrust my results per se.. I'm not tha

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Christian Jauvin
Many thanks to all for your help and detailed answers, I really appreciate it. So I wanted to test the discussion's takeaway, namely, what Peter suggested: one-hot encode the categorical features with small cardinality, and leave the others in their ordinal form. So from the same dataset I mentio

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Christian Jauvin
7;s really what I observe: apart from the first of my 4 variables, which is a year, the remaining 3 are purely categorical, with no implicit order. So that result is weird because it is not in line with what you've been saying. Anyway, thanks for your time and patience, Christian ---

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Christian Jauvin
x (i.e. 4 categorical variables, non-one-hot encoded) performs the same (to the third decimal in accuracy and AUC, with 10-fold CV) as with its equivalent, one-hot encoded (21080 x 1347) matrix. Sorry if the confusion is on my side, but d

[Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Christian Jauvin
does it make sense? Am I "diluting" the power of the RF by doing so, and should I rather try to combine two classifiers specializing on both types of features?" http://stats.stackexchange.com/questions/60162/random-forest-with-a-mix-of-categorical-

Re: [Scikit-learn-general] Easy way to handle .arff files in sklearn?

2013-03-05 Thread Christian
For me it works fine. Cheers, Christian > test.arff @relation 'test' @attribute v1 {blonde,blue} @attribute v2 numeric @attribute v3 numeric @attribute class {yes,no} @data blonde,17.2 ,1,yes blue,27.2,2,yes blue,18.2,3,no < end test.arff barray [['blonde', 17.2, 1.

Re: [Scikit-learn-general] Easy way to handle .arff files in sklearn?

2013-03-05 Thread Christian
Hi Tom, recently I saw the arff-package in pypi. Seems working. import arff import numpy as np barray = [] for row in arff.load('/home/chris/tools/weka-3-7-6/rd54_train.arff'): barray.append(list(row)) nparray = np.array(barray) print nparray.shape (4940, 56) HTH Christian &g

[Scikit-learn-general] feature selection & scoring

2013-02-22 Thread Christian
Hi, when I train a classification model with feature selected data, I'll need for future scoring issues the selector object and the model object. So I'll must persist both ( i.e. with pickle ), right ? Many thanks

[Scikit-learn-general] Using a cluster algorithm ex-post as predictor

2012-11-12 Thread Christian
Hi, after fitting a clusterer I'll label new data. Is there an easier way instead of building an ex-post classifier. Many thanks Christian example in weka: #Building the clusterer and save the object in cluster.cla java -cp weka.jar weka.clusterers.EM -t data0.arff -d cluste

Re: [Scikit-learn-general] "reverse feature engineering" (or something vague like that)

2012-10-02 Thread Christian Jauvin
e, it is a bit heavy on the math side). What do you think? [0] http://jmlr.csail.mit.edu/papers/volume11/baehrens10a/baehrens10a.pdf On 2 October 2012 14:34, Christian Jauvin wrote: >> * "Advice for applying Machine Learning" [1] gives general recommendations >> on ho

Re: [Scikit-learn-general] Shift-invariant Sparse Coding in scikit-learn?

2012-10-02 Thread Christian Vollmer
ikit-learn seems pretty much optimized. Or is it? Am 28.09.2012 14:29, schrieb Andreas Mueller: > Hi Christian. > Are you thinking about 1d or 2d convolutions? > I am not so familiar with 1d signal processing but there has > been some work on convolutional sparse coding for image

[Scikit-learn-general] "reverse feature engineering" (or something vague like that)

2012-10-01 Thread Christian Jauvin
t like "reverse engineering the features". So my question: is there a mechanism or maybe an already existing framework or theory for doing this? And would something approaching it be possible currently with Sklearn? Thanks, Christian -

[Scikit-learn-general] Shift-invariant Sparse Coding in scikit-learn?

2012-09-28 Thread Christian Vollmer
building a dictionary of all shifted versions of all atoms and then apply the implemented sparse coding algorithms. However, I don't see a shift-invariant way for the dictionary learning part. Thanks, Christian -- Got v

Re: [Scikit-learn-general] threading error when training a RFC on a big dataset

2012-09-24 Thread Christian Jauvin
happening anymore. But I'd be curious to know if there are any mechanism I could use to allow a Random Forest classifier to work with bigger datasets (than what simply fits in memory)? Thanks! On 22 September 2012 16:18, Olivier Grisel wrote: > 2012/9/22 Christian Jauvin : >> Hi,

[Scikit-learn-general] threading error when training a RFC on a big dataset

2012-09-22 Thread Christian Jauvin
7/multiprocessing/pool.py", line 319, in _handle_tasks put(task) SystemError: NULL result without error in PyObject_Call I can provide additional details of course, but first maybe there is something in particular I should be aware of, about size or memory limit of the underlying objects in

Re: [Scikit-learn-general] link between a classifier's score and fit methods

2012-09-21 Thread Christian Jauvin
i.e. the outcome of predict). Is there a workaround for that, or is that a case where subclassing is needed, as I had concluded before? Christian -- Got visibility? Most devs has no idea what their production app looks like.

Re: [Scikit-learn-general] link between a classifier's score and fit methods

2012-09-21 Thread Christian Jauvin
Hi Andreas, You mean that I could use cross_val_score's score_func argument? I tried it once, and it didn't work for some reason, and so I sticked with the inheritance solution, which is really a 3 line modification anyway. Is there another way? Best, Christian On 21 September

Re: [Scikit-learn-general] link between a classifier's score and fit methods

2012-09-21 Thread Christian Jauvin
Hi Gilles, > Are you sure the RF classifier is the same in both case? (have you set > the random state to the same value?) You're right, I forgot about that! I just tested it, and both classifiers indeed produce identical predictions with the same random_state value. Thanks,

[Scikit-learn-general] link between a classifier's score and fit methods

2012-09-21 Thread Christian Jauvin
I have a classifier which derives from RandomForestClassifier, in order to implement a custom "score" method. This obviously affects scoring results obtained with cross-validation, but I observed that it seems to also affect the actual predictions. In other words, the same RF classifier with two di

[Scikit-learn-general] Two problems with SGDClassifier.predict_proba()

2012-09-12 Thread Christian Jauvin
(1) When I try to use it with a sparse matrix I get (for a binary problem): --> 585 proba = np.ones((len(X), 2), dtype=np.float64) --> 175 raise TypeError("sparse matrix length is ambiguous; use getnnz()" 176 " or shape[0]") (2) When I try to use it fo

Re: [Scikit-learn-general] computing the sample weights

2012-09-12 Thread Christian Jauvin
Thanks, that's very helpful! On 12 September 2012 11:47, Peter Prettenhofer wrote: > 2012/9/12 Peter Prettenhofer : >> [..] >> >> AFAIK Fabian has some scikit-learn code for that as well. > > here is the code https://gist.github.com/2071994 > > > -- > Peter Prettenhofer > > -

Re: [Scikit-learn-general] computing the sample weights

2012-09-12 Thread Christian Jauvin
> May I ask why you think you need this? It was my naive assumption of how to tackle class imbalance with an SGD classifier, but as Olivier already suggested, using class_weight makes more sense for this. Is there another mechanism or strategy that I should be aware of you think?

[Scikit-learn-general] computing the sample weights

2012-09-12 Thread Christian Jauvin
repeat(p, len(y)) for i, v in enumerate(y): w[i] /= bc[v] assert np.sum(w) == 1 return w Christian -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and

[Scikit-learn-general] Memory explosion with GridSearchCV

2012-09-10 Thread Christian Jauvin
# ~303MB y = np.asarray(x) print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024. # ~875MB It doesn't make sense that np.asarray should almost triple the memory consumption, doesn't it? (With my real data, it's way worse, but I cannot seem to replicate it with a simulat