On Tue, Feb 2, 2010 at 10:11 PM, Kent Johnson <ken...@tds.net> wrote: > On Tue, Feb 2, 2010 at 1:39 PM, Norman Khine <nor...@khine.net> wrote: >> On Tue, Feb 2, 2010 at 4:19 PM, Kent Johnson <ken...@tds.net> wrote: >>> On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine <nor...@khine.net> wrote: >>>> On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson <ken...@tds.net> wrote: >>>>> On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine <nor...@khine.net> wrote: > >>>>> Why do you use repr() here? > >>> >>> It smells of programming by guess rather than a correct solution to >>> some problem. What happens if you take it out? >> >> when i take it out, i get an empty list. >> >> whereas both >> data = repr( file.read().decode('latin-1') ) >> and >> data = repr( file.read().decode('utf-8') ) >> >> returns the full list. > > Try this version: > > data = file.read() > > get_records = re.compile(r"""openInfoWindowHtml\(.*?\ticon: > myIcon\n""", re.DOTALL).findall > get_titles = re.compile(r"""<strong>(.*)<\/strong>""").findall > get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall > get_latlngs = > re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)""").findall > > then as before. > > Your repr() call is essentially removing newlines from the input by > converting them to literal '\n' pairs. This allows your regex to work > without the DOTALL modifier. > > Note you will get slightly different results with my version - it will > give you correct utf-8 text for the titles whereas yours gives \ > escapes. For example one of the titles is "CGTSM (Satére Mawé)". Your > version returns > > {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804', > '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'} > > Mine gives > {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804', > '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'} > > This is showing the repr() of the title so they both have \ but note > that yours has two \\ indicating that the \ is in the text; mine has > only one \.
i am no expert, but there seems to be a bigger difference. with repr(), i get: Sat\\xe9re Maw\\xe9 where as you get Sat\xc3\xa9re Maw\xc3\xa9 repr()'s é == \\xe9 whereas on your version é == \xc3\xa9 > > Kent > also, i still get an empty list when i run the code as suggested. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor