Sorry, something went wrong and my message got sent before I could finish it. I'll try again.
On Tue, Nov 30, 2010 at 2:19 PM, Josep M. Fontana <josep.m.font...@gmail.com> wrote: > On Sun, Nov 28, 2010 at 6:03 PM, Evert Rol <evert....@gmail.com> wrote: > <snip intro> <snip> >> --------- >> with open('output_tokens.txt', 'a') as out_tokens: >> with open('text.txt', 'r') as in_tokens: >> t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S') >> output = t.tokenize(in_tokens.read()) >> for item in output: >> out_tokens.write(" %s" % (item)) > > I don't know for sure, but I would hazard a guess that you didn't specify > unicode for the regular expression: character classes like \w and \s are > dependent on your LOCALE settings. > A flag like re.UNICODE could help, but I don't know if Regexptokenizer > accepts that. OK, this must be the problem. The text is in ISO-8859-1 not in Unicode. I tried to fix the problem by doing the following: ------------- import codecs [...] with codecs.open('output_tokens.txt', 'a', encoding='iso-8859-1') as out_tokens: with codecs.open('text.txt', 'r', encoding='iso-8859-1') as in_tokens: t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S') output = t.tokenize(in_tokens.read()) for item in output: out_tokens.write(" %s" % (item)) ------------------- Specifying that the encoding is 'iso-8859-1' didn't do anything, though. The output I get is still the same. >> It would also appear that you could get a long way with the builtin re.split >> function, and supply the flag inside that function; no need then or >> Regexptokenizer. Your tokenizer just appears to split on the tokens you >> specify. Yes. This is in fact what Regexptokenizer seems to do. Here's what the little description of the class says: """ A tokenizer that splits a string into substrings using a regular expression. The regular expression can be specified to match either tokens or separators between tokens. Unlike C{re.findall()} and C{re.split()}, C{RegexpTokenizer} does not treat regular expressions that contain grouping parenthases specially. """ source: http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/tokenize/regexp.py?r=8539 Since I'm using the NLTK package and this module seemed to do what I needed, I thought I might as well use it. I thought (and I still do) the problem I was didn't have to do with the correct use of this module but in the way I constructed the regular expression. I wouldn't have asked the question here if I thought that the problem had to do with this module. If I understand correctly how the re.split works, though, I don't think I would obtain the results I want, though. re.split would allow me to get a list of the strings that occur around the pattern I specify as the first argument in the function, right? But what I want is to match all the words that contain some non alpha-numeric character in them and exclude the rest of the words. Since these words are surrounded by spaces or by line returns or a combination thereof, just as the other "normal" words, I can't think of any pattern that I can use in re.split() that would discriminate between the two types of strings. So I don't know how I would do what I want with re.split. Josep M. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor