To detect this, you have to do word n-grams (or character n-grams over
word boundaries, which would not result in your problem).
If A is a stop-word, that would also not be caught, right?
So how would using stop-word instead of minimum length fix your issue?
Because you would have rather looked
I was wondering what the rationale is for making the default token pattern
for the CountVectorizer require *2* or more alphanumeric characters to form
a token. This was not intuitive default behavior for me, so I ended up with
a bug where some strings in my vocabulary like "Hepatitis A" were not
co