<snip intro> > Here's what I do. This was just a first attempt to get strings > starting with a non alpha-numeric symbol. If this had worked, I would > have continued to build the regular expression to get words with non > alpha-numeric symbols in the middle and in the end. Alas, even this > first attempt didn't work. > > --------- > with open('output_tokens.txt', 'a') as out_tokens: > with open('text.txt', 'r') as in_tokens: > t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S') > output = t.tokenize(in_tokens.read()) > for item in output: > out_tokens.write(" %s" % (item)) > > -------- > > What puzzles me is that I get some results that don't make much sense > given the regular expression. Here's some excerpt from the text I'm > processing: > > --------------- > "<filename=B-05-Libro_Oersino__14-214-2.txt> > > %Pág. 87 > &L-[LIBRO VII. DE OÉRSINO]&L+ &// > §Comeza el ·VII· libro, que es de Oérsino las bístias. &// > §Canto Félix ha tomado prenda del phisoloffo, el […] ·II· hómnes, e ellos" > ---------------- > > > Here's the relevant part of the output file ('output_tokens.txt'): > > ---------- > " <filename= -05- _Oersino__14- -2. %Pág. &L- [LLIBRO ÉRSINO] &L+ > §Comenza ·VII· ístias. §Canto élix ·II· ómnes" > ----------- > > If you notice, there are some words that have an accented character > that get treated in a strange way: all the characters that don't have > a tilde get deleted and the accented character behaves as if it were a > non alpha-numeric symbol. > > What is going on? What am I doing wrong?
I don't know for sure, but I would hazard a guess that you didn't specify unicode for the regular expression: character classes like \w and \s are dependent on your LOCALE settings. A flag like re.UNICODE could help, but I don't know if Regexptokenizer accepts that. It would also appear that you could get a long way with the builtin re.split function, and supply the flag inside that function; no need then or Regexptokenizer. Your tokenizer just appears to split on the tokens you specify. Lastly, an output convenience: output.write(' '.join(list(output))) instead of the for-loop. (I'm casting output to a list here, since I don't know whether output is a list or an iterator.) Let us know how if UNICODE (or other LOCALE settings) can solve your problem. Cheers, Evert _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor