hello,

i have this piece of code (http://pastie.org/5366200) which uses
BeatifulSoup to scrape content from a site, the html for the example
can be seen here http://pastie.org/5366172

               short_description = soup.find('div', attrs={"class":
"short-description"})
               if short_description:
                   short_desc = short_description.find('div',
attrs={"class": "std"})
                   if short_desc:
                       adm_product.append(short_desc.renderContents())

               long_description = soup.find('div', attrs={"class":
"box-collateral box-description"})
               if long_description:
                   long_desc = long_description.find('div',
attrs={"class": "std"})
                   if long_desc:
                       adm_product.append(long_desc.renderContents())
                       L = []
                       for tag in long_desc.recursiveChildGenerator():
                           if isinstance(tag,BeautifulSoup.Tag):
                               L.append(tag.renderContents())
                       desc = " ".join(v for v in L if v > 0)
                       print desc
                       adm_product.append(desc)
                   else:
                       adm_product.append('pas du description')

               # we get the country and producer
               for txt in product_shop.findAll(text=True):
                   if re.search('Origine',txt,re.I):
                       origin = txt.next.strip()
                       try:
                           country, producer = origin.split(', ')
                       except Exception, e:
                           pass
                       else:
                           adm_product.append(country)
                           adm_product.append(producer)

when i print the adm_product list i get:

['002267', 'Barre chocolat au lait fourr\xc3\xa9e \xc3\xa0 la
cr\xc3\xa8me de lait<br />25g, bio et \xc3\xa9quitable<br />Produit
bio contr\xc3\xb4l\xc3\xa9 par Bio Inspecta', '<strong>CHOKICHOC : la
barre de chocolat au lait, fourr&eacute;e &agrave; la cr&egrave;me de
lait</strong> CHOKICHOC : la barre de chocolat au lait, fourr&eacute;e
&agrave; la cr&egrave;me de lait  Exquis m&eacute;lange des plus fins
cacaos et de l&rsquo;aromatique sucre bio du Paraguay, CHOKICHOC est
compos&eacute;e exclusivement de mati&egrave;res premi&egrave;res
cultiv&eacute;es sans additif ni ar&ocirc;me artificiel. Tous les
ingr&eacute;dients proviennent de cultures biologiques.
<strong>L&eacute;g&egrave;re, fondante, id&eacute;ale pour le
go&ucirc;ter, un vrai d&eacute;lice!</strong> L&eacute;g&egrave;re,
fondante, id&eacute;ale pour le go&ucirc;ter, un vrai d&eacute;lice!
La commercialisation des barres CHOKICHOC garantit un prix minimum
pour le producteur, des contrats d&rsquo;achats &agrave; long terme
ainsi que le pr&eacute;financement partiel de la r&eacute;colte.',
'0,90\xc2\xa0',
u'/product/cache/1/image/9df78eab33525d08d6e5fb8d27136e95/0/0/002267_2.jpg',
u'Burkina Faso', u'Cercle des S\xe9cheurs']

my list item[1] is correctly encoded, but item[2] is not; nor are the
last 2 items

what am i missing?

thanks
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to