On Sun, Nov 28, 2010 at 6:14 PM, Steven D'Aprano <st...@pearwood.info> wrote: <snip> > Have you considered just using the isalnum() method? > >>>> '¿de'.isalnum() > False
Mmm. No, I didn't consider it because I didn't even know such a method existed. This can turn out to be very handy but I don't think it would help me at this stage because the texts I'm working with contain also a lot of non alpha-numeric characters that occur in isolation. So I would get a lot of noise. > The first thing to do is to isolate the cause of the problem. In your code > below, you do four different things. In no particular order: > > 1 open and read an input file; > 2 open and write an output file; > 3 create a mysterious "RegexpTokenizer" object, whatever that is; > 4 tokenize the input. > > We can't run your code because: > > 1 we don't have access to your input file; > 2 most of us don't have the NLTK package; > 3 we don't know what RegexTokenizer does; > 4 we don't know what tokenizing does. As I said in my answer to Evert, I assumed the problem I was having had to do exclusively with the regular expression pattern I was using. The code for RegexTokenizer seems to be pretty simple (http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/tokenize/regexp.py?r=8539) and all it does is: """ Tokenizers that divide strings into substrings using regular expressions that can match either tokens or separators between tokens. """ <snip> > you should write: > > r'[^a-zA-Z\s0-9]+\w+\S' Now you can understand why I didn't use r' ' The methods in the module already use this internally and I just need to insert the regular expression as the argument. > Your regex says to match: > > - one or more characters that aren't letters a...z (in either > case), space or any digit (note that this is *not* the same as > characters that aren't alphanum); > > - followed by one or more alphanum character; > > - followed by exactly one character that is not whitespace. > > I'm guessing the "not whitespace" is troublesome -- it will match characters > like ¿ because it isn't whitespace. This was my first attempt to match strings like: '&patre--' or '&patre' The "not whitespace" was intended to match the occurrence of non-alphanumeric characters appearing after "regular" characters. I realize I should have added '*' after '\S' since I also want to match words that do not have a non alpha-numeric symbol at the end (i.e '&patre' as opposed to '&patre--' > > I'd try this: > > # untested > \b.*?\W.*?\b > > which should match any word with a non-alphanumeric character in it: > > - \b ... \b matches the start and end of the word; > > - .*? matches zero or more characters (as few as possible); > > - \W matches a single non-alphanumeric character. > > So putting it all together, that should match a word with at least one > non-alphanumeric character in it. But since '.' matches any character except for a newline, this would also yield strings where all the characters are non-alphanumeric. I should have said this in my initial message but the texts I'm working with contain lots of these strings with sequences of non-alphanumeric characters (i.e. '&%+' or '&//'). I'm trying to match only strings that are a mixture of both non-alphanumeric characters and [a-zA-Z]. > [...] >> >> If you notice, there are some words that have an accented character >> that get treated in a strange way: all the characters that don't have >> a tilde get deleted and the accented character behaves as if it were a >> non alpha-numeric symbol. > > Your regex matches if the first character isn't a space, a digit, or a > a-zA-Z. Accented characters aren't a-z or A-Z, and therefore will match. I guess this is because the character encoding was not specified but accented characters in the languages I'm dealing with should be treated as a-z or A-Z, shouldn't they? I mean, how do you deal with languages that are not English with regular expressions? I would assume that as long as you set the right encoding, Python will be able to determine which subset of specific sequences of bytes count as a-z or A-Z. Josep M. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor