Re: [Tutor] A regular expression problem
Josep M. Fontana wrote: [...] I guess this is because the character encoding was not specified but accented characters in the languages I'm dealing with should be treated as a-z or A-Z, shouldn't they? No. a-z means a-z. If you want the localized set of alphanumeric characters, you need \w. Likewise 0-9 means 0-9. If you want localized digits, you need \d. > I mean, how do you deal with languages that are not English with regular expressions? I would assume that as long as you set the right encoding, Python will be able to determine which subset of specific sequences of bytes count as a-z or A-Z. Encodings have nothing to do with this issue. Literal characters a, b, ..., z etc. always have ONE meaning: they represent themselves (although possibly in a case-insensitive fashion). E means E, not È, É, Ê or Ë. Localization tells the regex how to interpret special patterns like \d and \w. This has nothing to do with encodings -- by the time the regex sees the string, it is already dealing with characters. Localization is about what characters are in categories ("is 5 a digit or a letter? how about ٣ ?"). Encoding is used to translate between bytes on disk and characters. For example, the character Ë could be stored on disk as the hex bytes: \xcb # one byte \xc3\x8b # two bytes \xff\xfe\xcb\x00 # four bytes and more, depending on the encoding used. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] A regular expression problem
On Sun, Nov 28, 2010 at 6:14 PM, Steven D'Aprano wrote: > Have you considered just using the isalnum() method? > '¿de'.isalnum() > False Mmm. No, I didn't consider it because I didn't even know such a method existed. This can turn out to be very handy but I don't think it would help me at this stage because the texts I'm working with contain also a lot of non alpha-numeric characters that occur in isolation. So I would get a lot of noise. > The first thing to do is to isolate the cause of the problem. In your code > below, you do four different things. In no particular order: > > 1 open and read an input file; > 2 open and write an output file; > 3 create a mysterious "RegexpTokenizer" object, whatever that is; > 4 tokenize the input. > > We can't run your code because: > > 1 we don't have access to your input file; > 2 most of us don't have the NLTK package; > 3 we don't know what RegexTokenizer does; > 4 we don't know what tokenizing does. As I said in my answer to Evert, I assumed the problem I was having had to do exclusively with the regular expression pattern I was using. The code for RegexTokenizer seems to be pretty simple (http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/tokenize/regexp.py?r=8539) and all it does is: """ Tokenizers that divide strings into substrings using regular expressions that can match either tokens or separators between tokens. """ > you should write: > > r'[^a-zA-Z\s0-9]+\w+\S' Now you can understand why I didn't use r' ' The methods in the module already use this internally and I just need to insert the regular expression as the argument. > Your regex says to match: > > - one or more characters that aren't letters a...z (in either > case), space or any digit (note that this is *not* the same as > characters that aren't alphanum); > > - followed by one or more alphanum character; > > - followed by exactly one character that is not whitespace. > > I'm guessing the "not whitespace" is troublesome -- it will match characters > like ¿ because it isn't whitespace. This was my first attempt to match strings like: '&patre--' or '&patre' The "not whitespace" was intended to match the occurrence of non-alphanumeric characters appearing after "regular" characters. I realize I should have added '*' after '\S' since I also want to match words that do not have a non alpha-numeric symbol at the end (i.e '&patre' as opposed to '&patre--' > > I'd try this: > > # untested > \b.*?\W.*?\b > > which should match any word with a non-alphanumeric character in it: > > - \b ... \b matches the start and end of the word; > > - .*? matches zero or more characters (as few as possible); > > - \W matches a single non-alphanumeric character. > > So putting it all together, that should match a word with at least one > non-alphanumeric character in it. But since '.' matches any character except for a newline, this would also yield strings where all the characters are non-alphanumeric. I should have said this in my initial message but the texts I'm working with contain lots of these strings with sequences of non-alphanumeric characters (i.e. '&%+' or '&//'). I'm trying to match only strings that are a mixture of both non-alphanumeric characters and [a-zA-Z]. > [...] >> >> If you notice, there are some words that have an accented character >> that get treated in a strange way: all the characters that don't have >> a tilde get deleted and the accented character behaves as if it were a >> non alpha-numeric symbol. > > Your regex matches if the first character isn't a space, a digit, or a > a-zA-Z. Accented characters aren't a-z or A-Z, and therefore will match. I guess this is because the character encoding was not specified but accented characters in the languages I'm dealing with should be treated as a-z or A-Z, shouldn't they? I mean, how do you deal with languages that are not English with regular expressions? I would assume that as long as you set the right encoding, Python will be able to determine which subset of specific sequences of bytes count as a-z or A-Z. Josep M. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] A regular expression problem
Sorry, something went wrong and my message got sent before I could finish it. I'll try again. On Tue, Nov 30, 2010 at 2:19 PM, Josep M. Fontana wrote: > On Sun, Nov 28, 2010 at 6:03 PM, Evert Rol wrote: > >> - >> with open('output_tokens.txt', 'a') as out_tokens: >>with open('text.txt', 'r') as in_tokens: >>t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S') >>output = t.tokenize(in_tokens.read()) >>for item in output: >>out_tokens.write(" %s" % (item)) > > I don't know for sure, but I would hazard a guess that you didn't specify > unicode for the regular expression: character classes like \w and \s are > dependent on your LOCALE settings. > A flag like re.UNICODE could help, but I don't know if Regexptokenizer > accepts that. OK, this must be the problem. The text is in ISO-8859-1 not in Unicode. I tried to fix the problem by doing the following: - import codecs [...] with codecs.open('output_tokens.txt', 'a', encoding='iso-8859-1') as out_tokens: with codecs.open('text.txt', 'r', encoding='iso-8859-1') as in_tokens: t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S') output = t.tokenize(in_tokens.read()) for item in output: out_tokens.write(" %s" % (item)) --- Specifying that the encoding is 'iso-8859-1' didn't do anything, though. The output I get is still the same. >> It would also appear that you could get a long way with the builtin re.split >> function, and supply the flag inside that function; no need then or >> Regexptokenizer. Your tokenizer just appears to split on the tokens you >> specify. Yes. This is in fact what Regexptokenizer seems to do. Here's what the little description of the class says: """ A tokenizer that splits a string into substrings using a regular expression. The regular expression can be specified to match either tokens or separators between tokens. Unlike C{re.findall()} and C{re.split()}, C{RegexpTokenizer} does not treat regular expressions that contain grouping parenthases specially. """ source: http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/tokenize/regexp.py?r=8539 Since I'm using the NLTK package and this module seemed to do what I needed, I thought I might as well use it. I thought (and I still do) the problem I was didn't have to do with the correct use of this module but in the way I constructed the regular expression. I wouldn't have asked the question here if I thought that the problem had to do with this module. If I understand correctly how the re.split works, though, I don't think I would obtain the results I want, though. re.split would allow me to get a list of the strings that occur around the pattern I specify as the first argument in the function, right? But what I want is to match all the words that contain some non alpha-numeric character in them and exclude the rest of the words. Since these words are surrounded by spaces or by line returns or a combination thereof, just as the other "normal" words, I can't think of any pattern that I can use in re.split() that would discriminate between the two types of strings. So I don't know how I would do what I want with re.split. Josep M. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] A regular expression problem
Josep M. Fontana wrote: I'm trying to use regular expressions to extract strings that match certain patterns in a collection of texts. Basically these texts are edited versions of medieval manuscripts that use certain symbols to mark information that is useful for filologists. I'm interested in isolating words that have some non alpha-numeric symbol attached to the beginning or the end of the word or inserted in them. Here are some examples: '¿de' ,'«orden', '§Don', '·II·', 'que·l', 'Rey»' Have you considered just using the isalnum() method? >>> '¿de'.isalnum() False You will have to split your source text into individual words, then isolate those where word.isalnum() returns False. I'm using some modules from a package called NLTK but I think my problem is related to some misunderstanding of how regular expressions work. The first thing to do is to isolate the cause of the problem. In your code below, you do four different things. In no particular order: 1 open and read an input file; 2 open and write an output file; 3 create a mysterious "RegexpTokenizer" object, whatever that is; 4 tokenize the input. We can't run your code because: 1 we don't have access to your input file; 2 most of us don't have the NLTK package; 3 we don't know what RegexTokenizer does; 4 we don't know what tokenizing does. Makes it hard to solve the problem for you, although I'm willing to make a few wild guesses (see below). The most important debugging skill you can learn is to narrow the problem down to the smallest possible piece of code that gives you the wrong answer. This will help you solve the problem yourself, and it will also help others help you. Can you demonstrate the problem in a couple of lines of code that doesn't rely on external files, packages, or other code we don't have? Here's what I do. This was just a first attempt to get strings starting with a non alpha-numeric symbol. If this had worked, I would have continued to build the regular expression to get words with non alpha-numeric symbols in the middle and in the end. Alas, even this first attempt didn't work. - with open('output_tokens.txt', 'a') as out_tokens: with open('text.txt', 'r') as in_tokens: t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S') output = t.tokenize(in_tokens.read()) for item in output: out_tokens.write(" %s" % (item)) Firstly, it's best practice to write regexes as "raw strings" by preceding them with an r. Instead of '[^a-zA-Z\s0-9]+\w+\S' you should write: r'[^a-zA-Z\s0-9]+\w+\S' Notice that the r is part of the delimiter (r' and ') and not the contents. This instructs Python to ignore the special meaning of backslashes. In this specific case, it won't make any difference, but it will make a big difference in other regexes. Your regex says to match: - one or more characters that aren't letters a...z (in either case), space or any digit (note that this is *not* the same as characters that aren't alphanum); - followed by one or more alphanum character; - followed by exactly one character that is not whitespace. I'm guessing the "not whitespace" is troublesome -- it will match characters like ¿ because it isn't whitespace. I'd try this: # untested \b.*?\W.*?\b which should match any word with a non-alphanumeric character in it: - \b ... \b matches the start and end of the word; - .*? matches zero or more characters (as few as possible); - \W matches a single non-alphanumeric character. So putting it all together, that should match a word with at least one non-alphanumeric character in it. (Caution: if you try this, you *must* use a raw string, otherwise you will get completely wrong results.) What puzzles me is that I get some results that don't make much sense given the regular expression. Well, I don't know how RegexTokenizer is supposed to work, so anything I say will be guesswork :) [...] If you notice, there are some words that have an accented character that get treated in a strange way: all the characters that don't have a tilde get deleted and the accented character behaves as if it were a non alpha-numeric symbol. Your regex matches if the first character isn't a space, a digit, or a a-zA-Z. Accented characters aren't a-z or A-Z, and therefore will match. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] A regular expression problem
> Here's what I do. This was just a first attempt to get strings > starting with a non alpha-numeric symbol. If this had worked, I would > have continued to build the regular expression to get words with non > alpha-numeric symbols in the middle and in the end. Alas, even this > first attempt didn't work. > > - > with open('output_tokens.txt', 'a') as out_tokens: >with open('text.txt', 'r') as in_tokens: >t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S') >output = t.tokenize(in_tokens.read()) >for item in output: >out_tokens.write(" %s" % (item)) > > > > What puzzles me is that I get some results that don't make much sense > given the regular expression. Here's some excerpt from the text I'm > processing: > > --- > " > > %Pág. 87 > &L-[LIBRO VII. DE OÉRSINO]&L+ &// > §Comeza el ·VII· libro, que es de Oérsino las bístias. &// > §Canto Félix ha tomado prenda del phisoloffo, el […] ·II· hómnes, e ellos" > > > > Here's the relevant part of the output file ('output_tokens.txt'): > > -- > " §Comenza ·VII· ístias. §Canto élix ·II· ómnes" > --- > > If you notice, there are some words that have an accented character > that get treated in a strange way: all the characters that don't have > a tilde get deleted and the accented character behaves as if it were a > non alpha-numeric symbol. > > What is going on? What am I doing wrong? I don't know for sure, but I would hazard a guess that you didn't specify unicode for the regular expression: character classes like \w and \s are dependent on your LOCALE settings. A flag like re.UNICODE could help, but I don't know if Regexptokenizer accepts that. It would also appear that you could get a long way with the builtin re.split function, and supply the flag inside that function; no need then or Regexptokenizer. Your tokenizer just appears to split on the tokens you specify. Lastly, an output convenience: output.write(' '.join(list(output))) instead of the for-loop. (I'm casting output to a list here, since I don't know whether output is a list or an iterator.) Let us know how if UNICODE (or other LOCALE settings) can solve your problem. Cheers, Evert ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] A regular expression problem
I'm trying to use regular expressions to extract strings that match certain patterns in a collection of texts. Basically these texts are edited versions of medieval manuscripts that use certain symbols to mark information that is useful for filologists. I'm interested in isolating words that have some non alpha-numeric symbol attached to the beginning or the end of the word or inserted in them. Here are some examples: '¿de' ,'«orden', '§Don', '·II·', 'que·l', 'Rey»' I'm using some modules from a package called NLTK but I think my problem is related to some misunderstanding of how regular expressions work. Here's what I do. This was just a first attempt to get strings starting with a non alpha-numeric symbol. If this had worked, I would have continued to build the regular expression to get words with non alpha-numeric symbols in the middle and in the end. Alas, even this first attempt didn't work. - with open('output_tokens.txt', 'a') as out_tokens: with open('text.txt', 'r') as in_tokens: t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S') output = t.tokenize(in_tokens.read()) for item in output: out_tokens.write(" %s" % (item)) What puzzles me is that I get some results that don't make much sense given the regular expression. Here's some excerpt from the text I'm processing: --- " %Pág. 87 &L-[LIBRO VII. DE OÉRSINO]&L+ &// §Comeza el ·VII· libro, que es de Oérsino las bístias. &// §Canto Félix ha tomado prenda del phisoloffo, el […] ·II· hómnes, e ellos" Here's the relevant part of the output file ('output_tokens.txt'): -- " http://mail.python.org/mailman/listinfo/tutor