Hi,
On 24 May 2016 at 04:17, Crusier <crus...@gmail.com> wrote: > > Dear All, > > I am trying to scrape a web site using Beautiful Soup. However, BS > doesn't show any of the data. I am just wondering if it is Javascript > or some other feature which hides all the data. > > I have the following questions: > > 1) Please advise how to scrape the following data from the website: > > 'http://www.dbpower.com.hk/en/quote/quote-warrant/code/10348' > > Type, Listing Date (Y-M-D), Call / Put, Last Trading Day (Y-M-D), > Strike Price, Maturity Date (Y-M-D), Effective Gearing (X),Time to > Maturity (D), > Delta (%), Daily Theta (%), Board Lot....... > > 2) I am able to scrape most of the data from the same site > > 'http://www.dbpower.com.hk/en/quote/quote-cbbc/code/63852' > > Please advise what is the difference between these two sites. You didn't state which version of Python you're using, nor what operating system, but the source contains print's with parenthesis, so I assume some version of Python 3 and I'm going to guess you're using Windows. Be that as it may, your program crashes with both Python 2 and Python 3. The str() conversion is flagged as a problem by Python2, stating: "Traceback (most recent call last): File "test.py", line 30, in <module> web_scraper(warrants) File "test.py", line 25, in web_scraper name1 = str(n.text) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 282: ordinal not in range(128)" Meanwhile Python3 breaks earlier with the message: "Traceback (most recent call last): File "test.py", line 30, in <module> web_scraper(warrants) File "test.py", line 18, in web_scraper print(soup) File "C:\Python35-32\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 435-439: character maps to <undefined>" Both of these alert you to the fact that this is due to some encoding issue. Aside from this your program seems to work, and the data you say you want to retrieve is in fact returned. So in short: If you avoid trying to implicitly encode the Unicode result from Beautiful soup into ASCII (or the local machine codepage) implicitly (which is what happens with your unqualified print calls) you should avoid the problem. But I guess you're going to want to continue to use print, and you may therefore want to know what the issue is and how you might avoid it. So: The reason for the problem is (basically as I understand it) that on Windows your console (which is where the results of the print statements go) is not Unicode aware. This implies that when you ask Python to print a Unicode string to the console, that first of all there must be a conversion from Unicode to something your console can accept, to allow the print to execute. On Python 2 if you don't explicitly deal with this, "ascii" is used which then duly falls over if it runs into anything that doesn't map cleanly into the ASCII character set. On Python 3, it is clever enough to figure out what my console codepage (cp850) is, which means more characters are mappable to my console character set, however this is still not enough to convert character 435-439 which is encountered in the Beautifulsoup result, as mentioned in the error message. The way to avoid this is to tell Python how to deal with this. For example (change lines marked with ****): from bs4 import BeautifulSoup import requests import json import re import sys #**** warrants = ['10348'] def web_scraper(warrants): url = "http://www.dbpower.com.hk/en/quote/quote-warrant/code/" # Scrape from the Web for code in warrants: new_url = url + code response = requests.get(new_url) html = response.content soup = BeautifulSoup(html,"html.parser") print(soup.encode(sys.stdout.encoding, "backslashreplace")) #**** name = soup.findAll('div', attrs={'class': 'article_content'}) #print(name) for n in name: name1 = n.text #**** s_code = name1[:4] print(name1.encode(sys.stdout.encoding, "backslashreplace")) #**** web_scraper(warrants) Here I'm picking up the encoding from stdout, which on my machine = "cp850". If sys.stdout.encoding is blank on your machine you might try something explicit or as a last resort you might try "utf-8" that should at least make the text "printable" (though perhaps not what you want.) I hope that helps (and look forward to possible corrections or improved advice from other list members as I'm admittedly not an expert on Unicode handling either.) For reference, in future always post full error messages, and version of Python/Operating system. Cheers Walter _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor