Re: [Scikit-learn-general] CountVectorizer token pattern

2014-09-19 Thread Andy
To detect this, you have to do word n-grams (or character n-grams over word boundaries, which would not result in your problem). If A is a stop-word, that would also not be caught, right? So how would using stop-word instead of minimum length fix your issue? Because you would have rather looked

[Scikit-learn-general] CountVectorizer token pattern

2014-09-17 Thread Nathan Breit
I was wondering what the rationale is for making the default token pattern for the CountVectorizer require *2* or more alphanumeric characters to form a token. This was not intuitive default behavior for me, so I ended up with a bug where some strings in my vocabulary like "Hepatitis A" were not co