Re: [Tutor] A regular expression problem

2010-12-01 Thread Steven D'Aprano

Josep M. Fontana wrote:
[...]

I guess this is because the character encoding was not specified but
accented characters in the languages I'm dealing with should be
treated as a-z or A-Z, shouldn't they? 


No. a-z means a-z. If you want the localized set of alphanumeric 
characters, you need \w.


Likewise 0-9 means 0-9. If you want localized digits, you need \d.


> I mean, how do you deal with

languages that are not English with regular expressions? I would
assume that as long as you set the right encoding, Python will be able
to determine which subset of specific sequences of bytes count as a-z
or A-Z.


Encodings have nothing to do with this issue.

Literal characters a, b, ..., z etc. always have ONE meaning: they 
represent themselves (although possibly in a case-insensitive fashion). 
E means E, not È, É, Ê or Ë.


Localization tells the regex how to interpret special patterns like \d 
and \w. This has nothing to do with encodings -- by the time the regex 
sees the string, it is already dealing with characters. Localization is 
about what characters are in categories ("is 5 a digit or a letter? how 
about ٣ ?").


Encoding is used to translate between bytes on disk and characters. For 
example, the character Ë could be stored on disk as the hex bytes:


\xcb  # one byte
\xc3\x8b  # two bytes
\xff\xfe\xcb\x00  # four bytes

and more, depending on the encoding used.


--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] A regular expression problem

2010-11-30 Thread Josep M. Fontana
On Sun, Nov 28, 2010 at 6:14 PM, Steven D'Aprano  wrote:

> Have you considered just using the isalnum() method?
>
 '¿de'.isalnum()
> False

Mmm. No, I didn't consider it because I didn't even know such a method
existed. This can turn out to be very handy but I don't think it would
help me at this stage because the texts I'm working with contain also
a lot of non alpha-numeric characters that occur in isolation. So I
would get a lot of noise.

> The first thing to do is to isolate the cause of the problem. In your code
> below, you do four different things. In no particular order:
>
> 1 open and read an input file;
> 2 open and write an output file;
> 3 create a mysterious "RegexpTokenizer" object, whatever that is;
> 4 tokenize the input.
>
> We can't run your code because:
>
> 1 we don't have access to your input file;
> 2 most of us don't have the NLTK package;
> 3 we don't know what RegexTokenizer does;
> 4 we don't know what tokenizing does.

As I said in my answer to Evert, I assumed the problem I was having
had to do exclusively with the regular expression pattern I was using.
The code for RegexTokenizer seems to be pretty simple
(http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/tokenize/regexp.py?r=8539)
and all it does is:

"""
Tokenizers that divide strings into substrings using regular
expressions that can match either tokens or separators between tokens.
"""



> you should write:
>
> r'[^a-zA-Z\s0-9]+\w+\S'

Now you can understand why I didn't use r' ' The methods in the module
already use this internally and I just need to insert the regular
expression as the argument.


> Your regex says to match:
>
> - one or more characters that aren't letters a...z (in either
>  case), space or any digit (note that this is *not* the same as
>  characters that aren't alphanum);
>
> - followed by one or more alphanum character;
>
> - followed by exactly one character that is not whitespace.
>
> I'm guessing the "not whitespace" is troublesome -- it will match characters
> like ¿ because it isn't whitespace.


This was my first attempt to match strings like:

'&patre--' or '&patre'

The "not whitespace" was intended to match the occurrence of
non-alphanumeric characters appearing after "regular" characters. I
realize I should have added '*' after '\S' since I also want to match
words that do not have a non alpha-numeric symbol at the end (i.e
'&patre' as opposed to '&patre--'

>
> I'd try this:
>
> # untested
> \b.*?\W.*?\b
>
> which should match any word with a non-alphanumeric character in it:
>
> - \b ... \b matches the start and end of the word;
>
> - .*? matches zero or more characters (as few as possible);
>
> - \W matches a single non-alphanumeric character.
>
> So putting it all together, that should match a word with at least one
> non-alphanumeric character in it.


But since '.' matches any character except for a newline, this would
also yield strings where all the characters are non-alphanumeric. I
should have said this in my initial message but the texts I'm working
with contain lots of these strings with sequences of non-alphanumeric
characters (i.e. '&%+' or '&//'). I'm trying to match only strings
that are a mixture of both non-alphanumeric characters and [a-zA-Z].

> [...]
>>
>> If you notice, there are some words that have an accented character
>> that get treated in a strange way: all the characters that don't have
>> a tilde get deleted and the accented character behaves as if it were a
>> non alpha-numeric symbol.
>
> Your regex matches if the first character isn't a space, a digit, or a
> a-zA-Z. Accented characters aren't a-z or A-Z, and therefore will match.

I guess this is because the character encoding was not specified but
accented characters in the languages I'm dealing with should be
treated as a-z or A-Z, shouldn't they? I mean, how do you deal with
languages that are not English with regular expressions? I would
assume that as long as you set the right encoding, Python will be able
to determine which subset of specific sequences of bytes count as a-z
or A-Z.

Josep M.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] A regular expression problem

2010-11-30 Thread Josep M. Fontana
Sorry, something went wrong and my message got sent before I could
finish it. I'll try again.

On Tue, Nov 30, 2010 at 2:19 PM, Josep M. Fontana
 wrote:
> On Sun, Nov 28, 2010 at 6:03 PM, Evert Rol  wrote:
> 
 
>> -
>> with open('output_tokens.txt', 'a') as out_tokens:
>>with open('text.txt', 'r') as in_tokens:
>>t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S')
>>output = t.tokenize(in_tokens.read())
>>for item in output:
>>out_tokens.write(" %s" % (item))
>
> I don't know for sure, but I would hazard a guess that you didn't specify 
> unicode for the regular expression: character classes like \w and \s are 
> dependent on your LOCALE settings.
> A flag like re.UNICODE could help, but I don't know if Regexptokenizer 
> accepts that.

 OK, this must be the problem. The text is in ISO-8859-1 not in
Unicode. I tried to fix the problem by doing the following:

-
import codecs
[...]
 with codecs.open('output_tokens.txt', 'a',  encoding='iso-8859-1') as
out_tokens:
with codecs.open('text.txt', 'r',  encoding='iso-8859-1') as in_tokens:
t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S')
output = t.tokenize(in_tokens.read())
for item in output:
 out_tokens.write(" %s" % (item))

---

Specifying that the encoding is 'iso-8859-1' didn't do anything,
though. The output I get is still the same.

>> It would also appear that you could get a long way with the builtin re.split 
>> function, and supply the flag inside that function; no need then or 
>> Regexptokenizer. Your tokenizer just appears to split on the tokens you 
>> specify.

Yes. This is in fact what Regexptokenizer seems to do. Here's what the
little description of the class says:

"""
A tokenizer that splits a string into substrings using a regular
expression.  The regular expression can be specified to match
either tokens or separators between tokens.

Unlike C{re.findall()} and C{re.split()}, C{RegexpTokenizer} does
not treat regular expressions that contain grouping parenthases
specially.
"""

source:
http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/tokenize/regexp.py?r=8539

Since I'm using the NLTK package and this module seemed to do what I
needed, I thought I might as well use it. I thought (and I still do)
the problem I was didn't have to do with the correct use of this
module but in the way I constructed the regular expression. I wouldn't
have asked the question here if I thought that the problem had to do
with this module.

If I understand correctly how the re.split works, though, I don't
think I would obtain the results I want, though.

re.split would allow me to get a list of the strings that occur around
the pattern I specify as the first argument in the function, right?
But what I want is to match all the words that contain some non
alpha-numeric character in them and exclude the rest of the words.
Since these words are surrounded by spaces or by line returns or a
combination thereof, just as the other "normal" words, I can't think
of any pattern that I can use in re.split() that would discriminate
between the two types of strings. So I don't know how I would do what
I want with re.split.

Josep M.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] A regular expression problem

2010-11-28 Thread Steven D'Aprano

Josep M. Fontana wrote:

I'm trying to use regular expressions to extract strings that match
certain patterns in a collection of texts. Basically these texts are
edited versions of medieval manuscripts that use certain symbols to
mark information that is useful for filologists.

I'm interested in isolating words that have some non alpha-numeric
symbol attached to the beginning or the end of the word or inserted in
them. Here are some examples:

'¿de' ,'«orden', '§Don', '·II·', 'que·l', 'Rey»'


Have you considered just using the isalnum() method?

>>> '¿de'.isalnum()
False

You will have to split your source text into individual words, then 
isolate those where word.isalnum() returns False.




I'm using some modules from a package called NLTK but I think my
problem is related to some misunderstanding of how regular expressions
work.


The first thing to do is to isolate the cause of the problem. In your 
code below, you do four different things. In no particular order:


1 open and read an input file;
2 open and write an output file;
3 create a mysterious "RegexpTokenizer" object, whatever that is;
4 tokenize the input.

We can't run your code because:

1 we don't have access to your input file;
2 most of us don't have the NLTK package;
3 we don't know what RegexTokenizer does;
4 we don't know what tokenizing does.

Makes it hard to solve the problem for you, although I'm willing to make 
a few wild guesses (see below).


The most important debugging skill you can learn is to narrow the 
problem down to the smallest possible piece of code that gives you the 
wrong answer. This will help you solve the problem yourself, and it will 
also help others help you. Can you demonstrate the problem in a couple 
of lines of code that doesn't rely on external files, packages, or other 
code we don't have?




Here's what I do. This was just a first attempt to get strings
starting with a non alpha-numeric symbol. If this had worked, I would
have continued to build the regular expression to get words with non
alpha-numeric symbols in the middle and in the end. Alas, even this
first attempt didn't work.

-
with open('output_tokens.txt', 'a') as out_tokens:
with open('text.txt', 'r') as in_tokens:
t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S')
output = t.tokenize(in_tokens.read())
for item in output:
out_tokens.write(" %s" % (item))


Firstly, it's best practice to write regexes as "raw strings" by 
preceding them with an r. Instead of


'[^a-zA-Z\s0-9]+\w+\S'

you should write:

r'[^a-zA-Z\s0-9]+\w+\S'

Notice that the r is part of the delimiter (r' and ') and not the 
contents. This instructs Python to ignore the special meaning of 
backslashes. In this specific case, it won't make any difference, but it 
will make a big difference in other regexes.


Your regex says to match:

- one or more characters that aren't letters a...z (in either
  case), space or any digit (note that this is *not* the same as
  characters that aren't alphanum);

- followed by one or more alphanum character;

- followed by exactly one character that is not whitespace.

I'm guessing the "not whitespace" is troublesome -- it will match 
characters like ¿ because it isn't whitespace.



I'd try this:

# untested
\b.*?\W.*?\b

which should match any word with a non-alphanumeric character in it:

- \b ... \b matches the start and end of the word;

- .*? matches zero or more characters (as few as possible);

- \W matches a single non-alphanumeric character.

So putting it all together, that should match a word with at least one 
non-alphanumeric character in it.


(Caution: if you try this, you *must* use a raw string, otherwise you 
will get completely wrong results.)




What puzzles me is that I get some results that don't make much sense
given the regular expression.


Well, I don't know how RegexTokenizer is supposed to work, so anything I 
say will be guesswork :)



[...]

If you notice, there are some words that have an accented character
that get treated in a strange way: all the characters that don't have
a tilde get deleted and the accented character behaves as if it were a
non alpha-numeric symbol.


Your regex matches if the first character isn't a space, a digit, or a 
a-zA-Z. Accented characters aren't a-z or A-Z, and therefore will match.




--
Steven

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] A regular expression problem

2010-11-28 Thread Evert Rol


> Here's what I do. This was just a first attempt to get strings
> starting with a non alpha-numeric symbol. If this had worked, I would
> have continued to build the regular expression to get words with non
> alpha-numeric symbols in the middle and in the end. Alas, even this
> first attempt didn't work.
> 
> -
> with open('output_tokens.txt', 'a') as out_tokens:
>with open('text.txt', 'r') as in_tokens:
>t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S')
>output = t.tokenize(in_tokens.read())
>for item in output:
>out_tokens.write(" %s" % (item))
> 
> 
> 
> What puzzles me is that I get some results that don't make much sense
> given the regular expression. Here's some excerpt from the text I'm
> processing:
> 
> ---
> "
> 
> %Pág. 87
> &L-[LIBRO VII. DE OÉRSINO]&L+ &//
> §Comeza el ·VII· libro, que es de Oérsino las bístias. &//
> §Canto Félix ha tomado prenda del phisoloffo, el […] ·II· hómnes, e ellos"
> 
> 
> 
> Here's the relevant part of the output file ('output_tokens.txt'):
> 
> --
> "  §Comenza ·VII· ístias. §Canto élix ·II· ómnes"
> ---
> 
> If you notice, there are some words that have an accented character
> that get treated in a strange way: all the characters that don't have
> a tilde get deleted and the accented character behaves as if it were a
> non alpha-numeric symbol.
> 
> What is going on? What am I doing wrong?


I don't know for sure, but I would hazard a guess that you didn't specify 
unicode for the regular expression: character classes like \w and \s are 
dependent on your LOCALE settings. 
A flag like re.UNICODE could help, but I don't know if Regexptokenizer accepts 
that.
It would also appear that you could get a long way with the builtin re.split 
function, and supply the flag inside that function; no need then or 
Regexptokenizer. Your tokenizer just appears to split on the tokens you specify.

Lastly, an output convenience:
output.write(' '.join(list(output)))
instead of the for-loop.
(I'm casting output to a list here, since I don't know whether output is a list 
or an iterator.)

Let us know how if UNICODE (or other LOCALE settings) can solve your problem.

Cheers,

  Evert


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] A regular expression problem

2010-11-28 Thread Josep M. Fontana
I'm trying to use regular expressions to extract strings that match
certain patterns in a collection of texts. Basically these texts are
edited versions of medieval manuscripts that use certain symbols to
mark information that is useful for filologists.

I'm interested in isolating words that have some non alpha-numeric
symbol attached to the beginning or the end of the word or inserted in
them. Here are some examples:

'¿de' ,'«orden', '§Don', '·II·', 'que·l', 'Rey»'

I'm using some modules from a package called NLTK but I think my
problem is related to some misunderstanding of how regular expressions
work.

Here's what I do. This was just a first attempt to get strings
starting with a non alpha-numeric symbol. If this had worked, I would
have continued to build the regular expression to get words with non
alpha-numeric symbols in the middle and in the end. Alas, even this
first attempt didn't work.

-
with open('output_tokens.txt', 'a') as out_tokens:
with open('text.txt', 'r') as in_tokens:
t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S')
output = t.tokenize(in_tokens.read())
for item in output:
out_tokens.write(" %s" % (item))



What puzzles me is that I get some results that don't make much sense
given the regular expression. Here's some excerpt from the text I'm
processing:

---
"

%Pág. 87
&L-[LIBRO VII. DE OÉRSINO]&L+ &//
§Comeza el ·VII· libro, que es de Oérsino las bístias. &//
 §Canto Félix ha tomado prenda del phisoloffo, el […] ·II· hómnes, e ellos"



Here's the relevant part of the output file ('output_tokens.txt'):

--
 " http://mail.python.org/mailman/listinfo/tutor