I'm trying to use regular expressions to extract strings that match certain patterns in a collection of texts. Basically these texts are edited versions of medieval manuscripts that use certain symbols to mark information that is useful for filologists.
I'm interested in isolating words that have some non alpha-numeric symbol attached to the beginning or the end of the word or inserted in them. Here are some examples: '¿de' ,'«orden', '§Don', '·II·', 'que·l', 'Rey»' I'm using some modules from a package called NLTK but I think my problem is related to some misunderstanding of how regular expressions work. Here's what I do. This was just a first attempt to get strings starting with a non alpha-numeric symbol. If this had worked, I would have continued to build the regular expression to get words with non alpha-numeric symbols in the middle and in the end. Alas, even this first attempt didn't work. --------- with open('output_tokens.txt', 'a') as out_tokens: with open('text.txt', 'r') as in_tokens: t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S') output = t.tokenize(in_tokens.read()) for item in output: out_tokens.write(" %s" % (item)) -------- What puzzles me is that I get some results that don't make much sense given the regular expression. Here's some excerpt from the text I'm processing: --------------- "<filename=B-05-Libro_Oersino__14-214-2.txt> %Pág. 87 &L-[LIBRO VII. DE OÉRSINO]&L+ &// §Comeza el ·VII· libro, que es de Oérsino las bístias. &// §Canto Félix ha tomado prenda del phisoloffo, el […] ·II· hómnes, e ellos" ---------------- Here's the relevant part of the output file ('output_tokens.txt'): ---------- " <filename= -05- _Oersino__14- -2. %Pág. &L- [LLIBRO ÉRSINO] &L+ §Comenza ·VII· ístias. §Canto élix ·II· ómnes" ----------- If you notice, there are some words that have an accented character that get treated in a strange way: all the characters that don't have a tilde get deleted and the accented character behaves as if it were a non alpha-numeric symbol. What is going on? What am I doing wrong? Josep M. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor