Re: [Scikit-learn-general] Topic extraction

2015-04-29 Thread C K Kashyap
Thanks Vlad and Lee, I just found out that the following loop is not really listing the right topics - k = 0 for i in np.argmax(nmf.transform(tfidf), axis=1): print("Topic = " , feature_names[i], " ", topic_weights[i]) print("Document = ", data[k]) k = k + 1 The complete code is here http://lpas

Re: [Scikit-learn-general] Topic extraction

2015-04-29 Thread Lee Zamparo
As Vlad suggests, the number of topics is a hyper-parameter, and you can optimize the value using cross-validation.  Though there are other hyper-parameter estimation methods in sklearn I think.  There are also many other closely related projects which could wrap your NMF and report back the id

Re: [Scikit-learn-general] Topic extraction

2015-04-29 Thread Vlad Niculae
Another thing I've seen people do is to threshold based on the difference between the scores of the best and second best topics. (Only take documents with a clear winning topic.) For estimating the number of topics, you can use cross-validation. Vlad On Wed, Apr 29, 2015 at 12:42 AM, Joel Nothman

Re: [Scikit-learn-general] Topic extraction

2015-04-29 Thread C K Kashyap
Thanks Joel, What about estimating the number of topics? Is there a recommended way to do it? Regards, Kashyap On Wed, Apr 29, 2015 at 12:25 PM, Joel Nothman wrote: > Yes, this is not a probabilistic method. > > On 29 April 2015 at 14:56, C K Kashyap wrote: > >> Works like a charm. Just notic

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Joel Nothman
Yes, this is not a probabilistic method. On 29 April 2015 at 14:56, C K Kashyap wrote: > Works like a charm. Just noticed though that the max value is sometimes > more than 1.0 is that okay? > > Regards, > Kashyap > > On Wed, Apr 29, 2015 at 10:12 AM, Joel Nothman > wrote: > >> mask with n

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread C K Kashyap
Works like a charm. Just noticed though that the max value is sometimes more than 1.0 is that okay? Regards, Kashyap On Wed, Apr 29, 2015 at 10:12 AM, Joel Nothman wrote: > mask with np.max(..., axis=1) > threshold > > On 29 April 2015 at 14:35, C K Kashyap wrote: > >> Thank you so much J

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Joel Nothman
mask with np.max(..., axis=1) > threshold On 29 April 2015 at 14:35, C K Kashyap wrote: > Thank you so much Joel, > > I understood. Just one more thing please. > > How can I include a document against it's highest ranking topic only if it > crosses a threshold? > > regards, > Kashyap > > On Wed,

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread C K Kashyap
Thank you so much Joel, I understood. Just one more thing please. How can I include a document against it's highest ranking topic only if it crosses a threshold? regards, Kashyap On Wed, Apr 29, 2015 at 9:45 AM, Joel Nothman wrote: > Highest ranking topic for each doc is just np.argmax(nmf.tr

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Joel Nothman
Highest ranking topic for each doc is just np.argmax(nmf.transform(tfidf), axis=1). This is because nmf.transform (tfidf) returns a matrix of shape (num samples, num components / to

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread C K Kashyap
Thanks Joel and Andreas, Joel, I think "highest ranking topic for each doc" is exactly what I am looking for. Could you elaborate on the code please? What would be dataset.target_names and dataset.target in my case - http://lpaste.net/131649 Regards, Kashyap On Wed, Apr 29, 2015 at 3:08 AM, Joe

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Joel Nothman
This shows the newsgroup name and highest scoring topic for each doc. zip(np.take(dataset.target_names, dataset.target), np.argmax(nmf.transform(tfidf), axis=1)) I think something based on this should be added to the example. On 29 April 2015 at 07:01, Andreas Mueller wrote: > Clusters are on

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Andreas Mueller
Clusters are one per data point, while topics are not. So the model is slightly different. You can get the list of topics for each sample using NMF().fit_transform(X). On 04/28/2015 01:13 PM, C K Kashyap wrote: Hi everyone, I am new to scikit. I only feel sad for not knowing it earlier - it's