Sorry, something went wrong and my message got sent before I could
finish it. I'll try again.

On Tue, Nov 30, 2010 at 2:19 PM, Josep M. Fontana
<josep.m.font...@gmail.com> wrote:
> On Sun, Nov 28, 2010 at 6:03 PM, Evert Rol <evert....@gmail.com> wrote:
> <snip intro>
 <snip>
>> ---------
>> with open('output_tokens.txt', 'a') as out_tokens:
>>    with open('text.txt', 'r') as in_tokens:
>>        t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S')
>>        output = t.tokenize(in_tokens.read())
>>        for item in output:
>>            out_tokens.write(" %s" % (item))
>
> I don't know for sure, but I would hazard a guess that you didn't specify 
> unicode for the regular expression: character classes like \w and \s are 
> dependent on your LOCALE settings.
> A flag like re.UNICODE could help, but I don't know if Regexptokenizer 
> accepts that.

 OK, this must be the problem. The text is in ISO-8859-1 not in
Unicode. I tried to fix the problem by doing the following:

-------------
import codecs
[...]
 with codecs.open('output_tokens.txt', 'a',  encoding='iso-8859-1') as
out_tokens:
    with codecs.open('text.txt', 'r',  encoding='iso-8859-1') as in_tokens:
        t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S')
        output = t.tokenize(in_tokens.read())
        for item in output:
             out_tokens.write(" %s" % (item))

-------------------

Specifying that the encoding is 'iso-8859-1' didn't do anything,
though. The output I get is still the same.

>> It would also appear that you could get a long way with the builtin re.split 
>> function, and supply the flag inside that function; no need then or 
>> Regexptokenizer. Your tokenizer just appears to split on the tokens you 
>> specify.

Yes. This is in fact what Regexptokenizer seems to do. Here's what the
little description of the class says:

"""
    A tokenizer that splits a string into substrings using a regular
    expression.  The regular expression can be specified to match
    either tokens or separators between tokens.

    Unlike C{re.findall()} and C{re.split()}, C{RegexpTokenizer} does
    not treat regular expressions that contain grouping parenthases
    specially.
    """

source:
http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/tokenize/regexp.py?r=8539

Since I'm using the NLTK package and this module seemed to do what I
needed, I thought I might as well use it. I thought (and I still do)
the problem I was didn't have to do with the correct use of this
module but in the way I constructed the regular expression. I wouldn't
have asked the question here if I thought that the problem had to do
with this module.

If I understand correctly how the re.split works, though, I don't
think I would obtain the results I want, though.

re.split would allow me to get a list of the strings that occur around
the pattern I specify as the first argument in the function, right?
But what I want is to match all the words that contain some non
alpha-numeric character in them and exclude the rest of the words.
Since these words are surrounded by spaces or by line returns or a
combination thereof, just as the other "normal" words, I can't think
of any pattern that I can use in re.split() that would discriminate
between the two types of strings. So I don't know how I would do what
I want with re.split.

Josep M.
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to