On Tue, Feb 2, 2010 at 11:36 PM, Kent Johnson <ken...@tds.net> wrote: > On Tue, Feb 2, 2010 at 4:56 PM, Norman Khine <nor...@khine.net> wrote: >> On Tue, Feb 2, 2010 at 10:11 PM, Kent Johnson <ken...@tds.net> wrote: > >>> Try this version: >>> >>> data = file.read() >>> >>> get_records = re.compile(r"""openInfoWindowHtml\(.*?\ticon: >>> myIcon\n""", re.DOTALL).findall >>> get_titles = re.compile(r"""<strong>(.*)<\/strong>""").findall >>> get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall >>> get_latlngs = >>> re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)""").findall >>> >>> then as before. >>> >>> Your repr() call is essentially removing newlines from the input by >>> converting them to literal '\n' pairs. This allows your regex to work >>> without the DOTALL modifier. >>> >>> Note you will get slightly different results with my version - it will >>> give you correct utf-8 text for the titles whereas yours gives \ >>> escapes. For example one of the titles is "CGTSM (Satére Mawé)". Your >>> version returns >>> >>> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804', >>> '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'} >>> >>> Mine gives >>> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804', >>> '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'} >>> >>> This is showing the repr() of the title so they both have \ but note >>> that yours has two \\ indicating that the \ is in the text; mine has >>> only one \. >> >> i am no expert, but there seems to be a bigger difference. >> >> with repr(), i get: >> Sat\\xe9re Maw\\xe9 >> >> where as you get >> >> Sat\xc3\xa9re Maw\xc3\xa9 >> >> repr()'s >> é == \\xe9 >> whereas on your version >> é == \xc3\xa9 > > Right. Your version has four actual characters in the result - \, x, > e, 9. This is the escaped representation of the unicode representation > of e-acute. (The \ is doubled in the repr display.) > > My version has two bytes in the result, with the values c3 and a9. > This is the utf-8 representation of e-acute. > > If you want to accurately represent (i.e. print) the title at some > later time you probably want the utf-8 represetation. >> >>> >>> Kent >>> >> >> also, i still get an empty list when i run the code as suggested. > > You didn't change the regexes. You have to change \\t and \\n to \t > and \n because the source text now has actual tabs and newlines, not > the escaped representations. > > I know this is confusing, I'm sorry I don't have time or patience to > explain more.
thanks for your time, i did realise after i posted the email that the regex needed to be changed. > > Kent > _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor