Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-05 Thread Andreas Mueller
On 06/04/2013 08:27 PM, Tom Fawcett wrote: On Jun 4, 2013, at 2:38 AM, Lars Buitinck l.j.buiti...@uva.nl wrote: 2013/6/4 Joel Nothman jnoth...@student.usyd.edu.au: NLP folks pass the blame to IR folks :P ... and IR folks always mean absolute frequency, unless stated otherwise. Coming from

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-05 Thread Joel Nothman
Or perhaps the docs should consider including a glossary that translates some of these meanings and specifies what is preferred for sklearn development/documentation. On Thu, Jun 6, 2013 at 2:17 AM, Andreas Mueller amuel...@ais.uni-bonn.dewrote: On 06/04/2013 08:27 PM, Tom Fawcett wrote: On

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-04 Thread Lars Buitinck
2013/6/4 Joel Nothman jnoth...@student.usyd.edu.au: NLP folks pass the blame to IR folks :P ... and IR folks always mean absolute frequency, unless stated otherwise. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-04 Thread Tom Fawcett
On Jun 4, 2013, at 2:38 AM, Lars Buitinck l.j.buiti...@uva.nl wrote: 2013/6/4 Joel Nothman jnoth...@student.usyd.edu.au: NLP folks pass the blame to IR folks :P ... and IR folks always mean absolute frequency, unless stated otherwise. Coming from ML, I’ve seen it used as both absolute and

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-03 Thread Andreas Mueller
On 06/02/2013 08:48 PM, Harold Nguyen wrote: Hi Lars, Thank you very much for this response. Please excuse my questions since I'm new. From here the document on TfidfVectorizer here: http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Does

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-03 Thread Lars Buitinck
2013/6/2 Harold Nguyen har...@nexgate.com: http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Does TfidfVectorizer take a sequence of filenames, where each file is just a plain text file ? Depends on the parameter input (the first in the list).

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-03 Thread Lars Buitinck
2013/6/3 Andreas Mueller amuel...@ais.uni-bonn.de: I named the variable, I think, and it is a bad name :-( Should we rename it? I think giving a count makes more sense than giving a frequency: you want to exclude outliers that appear only once or twice for example. I actually hadn't seen

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-02 Thread Lars Buitinck
2013/6/1 Harold Nguyen har...@nexgate.com: I was wondering if anyone can point me to a tutorial on clustering text documents, but then also displaying the results in a graph ? I see some examples on clustering text documents, but I'd like to be able to visualize the clusters. You'll need

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-02 Thread Harold Nguyen
Hi Lars, Thank you very much for this response. Please excuse my questions since I'm new. From here the document on TfidfVectorizer here: http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Does TfidfVectorizer take a sequence of filenames, where