-learn-general
Betreff: Re: [Scikit-learn-general] Latent Dirichlet Allocation
How many distinct words are in your dataset?
On 27 January 2016 at 00:21, Rockenkamm, Christian
mailto:[email protected]>>
wrote:
Hallo,
I have question concerning the Latent Dirichlet Allocatio
affected, depending on the parameter setting.
Does anybody have an idea as to what might be causing this problem and how to
resolve it?
Best regards,
Christian Rockenkamm
--
Site24x7 APM Insight: Get Deep Visibility into
affected, depending on the parameter setting.
Does anybody have an idea as to what might be causing this problem and how to
resolve it?
Best regards,
Christian Rockenkamm
--
Site24x7 APM Insight: Get Deep Visibility into
Hello,
I have a short question concerning the Latent Dirichlet Allocation in scikit.
Is it possible to acquire the topic-word-matrix and the document-topic-matrix?
If so, could someone please explain to me how to do that?
Best regards,
Christian Rockenkamm
least
for prototyping
(1) No need to organize this huge amount of models in a database
(serialization)
(2) Comparability between the scores
Disadvantage:
(1) Difficult to adjust/weighting the outcome
Many thanks
Christian
> What I had in mind (for the LB) was an option to "reserve" an extra
> column at the LB creation, which could then be used to map all the
> unknown values further encountered by "transform". This column would
> obviously be all zeros in the matrix returned by "fit_transform" (i.e.
> could only con
> I think the encoders should all be able to deal with unknown labels.
> The thing about the extra single value is that you don't have a column
> to map it to.
> How would you use the extra value in LabelBinarizer or OneHotEncoder?
You're right, and this points to a difference between what PR #324
an issue on github?
>
> I am not sure that it would make sense to add a unknown columns
> label with an optional parameter. But you could easily add one with
> some numpy operations
>
> np.hstack([y, y.sum(axis=1,keepdims=True) == 0])
>
> Best regards,
> Arnaud
>
>
>
e.net/p/scikit-learn/mailman/message/31827616/
So if my understanding of this mechanism is correct (as well as my
assumptions about the way it is/should be used), would it make sense
to add something like a "map_unknowns_to_single_class" extra parameter
to all the preprocessing encod
> If I understand you correctly, one way to reconcile the difference
> between the two interpretations (multinomial vs binomial) would be to
> binarize first my boolean input variable:
Just for the sake of clarity: I meant to add the complement to my
input variable (i.e. as a second feature), rath
enever one
tries to use it the way I did (i.e. assuming a binomial event model),
he would silently obtain wrong results? Isn't there a use for the
binomial case?
Thanks,
Christian
--
Open source business pr
array([[ 15., 10.],
[ 45., 30.]]))
What explains the difference in terms of the Chi-Square value (0.5 vs 2)
and the P-value (0.48 vs 0.157)?
Thanks,
Christian
--
Open source business process management suite buil
y measure, or dealing with
> large quantities of sparse data in a memory efficient way? If it is the
> latter, you can look into feature hashing:
> http://en.wikipedia.org/wiki/Feature_hashing
>
> regards
> shankar.
>
>
>
>
> On Wed, Apr 23, 2014 at 9:59 AM, Ch
the very
skewed distribution.
I'd greatly appreciate any idea or suggestion about this problem.
Thanks,
Christian
--
Start Your Social Network Today - Download eXo Platform
Build your Enterprise Intranet with eXo Platform
(which is what I assume because it can be
considered I guess as a form of data leakage), what is the standard
way to solve the issue of test values (for a categorical variable)
that have never been encountered in the training set?
On 9 January 2014 15:21, Christian Jauvin wrote:
> Hi,
>
>
Hi,
If a LabelEncoder has been fitted on a training set, it might break if it
encounters new values when used on a test set.
The only solution I could come up with for this is to map everything new in
the test set (i.e. not belonging to any existing class) to "", and
then explicitly add a corresp
>> I believe more in my results than in my expertise - and so should you :-)
>
> +1! There's very very few examples of theory trumping data in history... And
> a bajillion of the converse.
I guess I didn't express myself clearly: I didn't mean to say that I
mistrust my results per se.. I'm not tha
Many thanks to all for your help and detailed answers, I really appreciate it.
So I wanted to test the discussion's takeaway, namely, what Peter
suggested: one-hot encode the categorical features with small
cardinality, and leave the others in their ordinal form.
So from the same dataset I mentio
7;s really what I observe: apart from the first of my 4
variables, which is a year, the remaining 3 are purely categorical,
with no implicit order. So that result is weird because it is not in
line with what you've been saying.
Anyway, thanks for your time and patience,
Christian
---
x (i.e. 4 categorical
variables, non-one-hot encoded) performs the same (to the third
decimal in accuracy and AUC, with 10-fold CV) as with its equivalent,
one-hot encoded (21080 x 1347) matrix.
Sorry if the confusion is on my side, but d
does it make sense? Am I
"diluting" the power of the RF by doing so, and should I rather try to
combine two classifiers specializing on both types of features?"
http://stats.stackexchange.com/questions/60162/random-forest-with-a-mix-of-categorical-
For me it works fine.
Cheers, Christian
> test.arff
@relation 'test'
@attribute v1 {blonde,blue}
@attribute v2 numeric
@attribute v3 numeric
@attribute class {yes,no}
@data
blonde,17.2 ,1,yes
blue,27.2,2,yes
blue,18.2,3,no
< end test.arff
barray
[['blonde', 17.2, 1.
Hi Tom,
recently I saw the arff-package in pypi. Seems working.
import arff
import numpy as np
barray = []
for row in arff.load('/home/chris/tools/weka-3-7-6/rd54_train.arff'):
barray.append(list(row))
nparray = np.array(barray)
print nparray.shape
(4940, 56)
HTH
Christian
&g
Hi,
when I train a classification model with feature selected data, I'll
need for future scoring issues the selector object and the model object.
So I'll must persist both ( i.e. with pickle ), right ?
Many thanks
Hi,
after fitting a clusterer I'll label new data. Is there an easier way
instead of building an ex-post classifier.
Many thanks
Christian
example in weka:
#Building the clusterer and save the object in cluster.cla
java -cp weka.jar weka.clusterers.EM -t data0.arff -d cluste
e, it is a bit heavy on the
math side). What do you think?
[0] http://jmlr.csail.mit.edu/papers/volume11/baehrens10a/baehrens10a.pdf
On 2 October 2012 14:34, Christian Jauvin wrote:
>> * "Advice for applying Machine Learning" [1] gives general recommendations
>> on ho
ikit-learn seems pretty much optimized. Or is it?
Am 28.09.2012 14:29, schrieb Andreas Mueller:
> Hi Christian.
> Are you thinking about 1d or 2d convolutions?
> I am not so familiar with 1d signal processing but there has
> been some work on convolutional sparse coding for image
t like "reverse engineering the features".
So my question: is there a mechanism or maybe an already existing
framework or theory for doing this? And would something approaching it
be possible currently with Sklearn?
Thanks,
Christian
-
building a dictionary of all shifted versions of all atoms and
then apply the implemented sparse coding algorithms. However, I don't
see a shift-invariant way for the dictionary learning part.
Thanks,
Christian
--
Got v
happening anymore.
But I'd be curious to know if there are any mechanism I could use to
allow a Random Forest classifier to work with bigger datasets (than
what simply fits in memory)?
Thanks!
On 22 September 2012 16:18, Olivier Grisel wrote:
> 2012/9/22 Christian Jauvin :
>> Hi,
7/multiprocessing/pool.py", line 319, in _handle_tasks
put(task)
SystemError: NULL result without error in PyObject_Call
I can provide additional details of course, but first maybe there is
something in particular I should be aware of, about size or memory
limit of the underlying objects in
i.e. the outcome of predict).
Is there a workaround for that, or is that a case where subclassing is
needed, as I had concluded before?
Christian
--
Got visibility?
Most devs has no idea what their production app looks like.
Hi Andreas,
You mean that I could use cross_val_score's score_func argument? I
tried it once, and it didn't work for some reason, and so I sticked
with the inheritance solution, which is really a 3 line modification
anyway. Is there another way?
Best,
Christian
On 21 September
Hi Gilles,
> Are you sure the RF classifier is the same in both case? (have you set
> the random state to the same value?)
You're right, I forgot about that!
I just tested it, and both classifiers indeed produce identical
predictions with the same random_state value.
Thanks,
I have a classifier which derives from RandomForestClassifier, in
order to implement a custom "score" method. This obviously affects
scoring results obtained with cross-validation, but I observed that it
seems to also affect the actual predictions. In other words, the same
RF classifier with two di
(1) When I try to use it with a sparse matrix I get (for a binary problem):
--> 585 proba = np.ones((len(X), 2), dtype=np.float64)
--> 175 raise TypeError("sparse matrix length is ambiguous;
use getnnz()"
176 " or shape[0]")
(2) When I try to use it fo
Thanks, that's very helpful!
On 12 September 2012 11:47, Peter Prettenhofer
wrote:
> 2012/9/12 Peter Prettenhofer :
>> [..]
>>
>> AFAIK Fabian has some scikit-learn code for that as well.
>
> here is the code https://gist.github.com/2071994
>
>
> --
> Peter Prettenhofer
>
> -
> May I ask why you think you need this?
It was my naive assumption of how to tackle class imbalance with an
SGD classifier, but as Olivier already suggested, using class_weight
makes more sense for this. Is there another mechanism or strategy that
I should be aware of you think?
repeat(p, len(y))
for i, v in enumerate(y):
w[i] /= bc[v]
assert np.sum(w) == 1
return w
Christian
--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
# ~303MB
y = np.asarray(x)
print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024. # ~875MB
It doesn't make sense that np.asarray should almost triple the memory
consumption, doesn't it? (With my real data, it's way worse, but I
cannot seem to replicate it with a simulat
40 matches
Mail list logo