Re: Suitable Python code to scrape specific details from web pages.
On 13/08/2014 7:28 AM, Roy Smith wrote: Second, if you're going to be parsing web pages, trying to use regexes is a losing game. You need something that knows how to parse HTML. The canonical answer is lxml (http://lxml.de/), but Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) is less intimidating to use. lxml also has a BeautifulSoup parser, so you can easily mix and match approaches: http://lxml.de/elementsoup.html -- https://mail.python.org/mailman/listinfo/python-list
Re: Suitable Python code to scrape specific details from web pages.
On Tue, 12 Aug 2014 13:00:30 -0700, Simon Evans wrote: in accessing from the 'Racing Post' on a daily basis. Anyhow, the code Following is some starter code. You will have to look at the output, compare it to the web page, and work out how you want to process it further. Note that I use beautifulsoup and requests. The output is the html for each cell in the table with a line of + characters at the table row breaks. I suggest you look at the beautifulsoup documentation at http://www.crummy.com/software/BeautifulSoup/bs4/doc/ to work out how you may wish to select which table cells contain data you are interested in and how to extract it. #!/usr/bin/python Program to extract data from racingpost. from bs4 import BeautifulSoup import requests r = requests.get( http://www.racingpost.com/horses2/cards/card.sd? race_id=607466r_date=2014-08-13#raceTabs=sc_ ) if r.status_code == 200: soup = BeautifulSoup( r.content ) table = soup.find( table, id=sc_horseCard ) for row in table.find_all( tr ): for cell in row.find_all( td ): print cell print + else: print HTTP Status, r.status_code -- Denis McMahon, denismfmcma...@gmail.com -- https://mail.python.org/mailman/listinfo/python-list
Suitable Python code to scrape specific details from web pages.
Dear Programmers, I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. I have tried a few of his python programs in the Python27 command prompt, but altered them from accessing data using links say from the Dow Jones index, to accessing the details I would be interested in accessing from the 'Racing Post' on a daily basis. Anyhow, the code it returns is not in the example I am going to give, is not the information I am seeking, instead of returning the given odds on a horse, it only returns a [], which isn't much use. I would be glad if you could tell me where I am going wrong. Yours faithfully Simon Evans. import urllib import re htmlfile = urllib.urlopen(http://www.racingpost.com/horses2/cards/card.sd? race_id=600048r_date=2014-05-08#raceTabs=sc_) htmltext = htmlfile.read() regex = 'strong1a href=http://www.racingpost.com/horses/horse_home.sd? horse_id=758752onclick=scorecards.send(quot;horse_namequot:):return Html.popup(this, {width:695,height:800})title=Full details about this HORSELively Baron/a9/4F/strongbr/' pattern = re.compile(regex) odds=re.findall(pattern,htmltext) print odds [] import urllib import re htmlfile = urllib.urlopen(http://www.racingpost.com/horses2/cards/card.sd? race_id=600048r_date=2014-05-08#raceTabs=sc_) htmltext = htmlfile.read() regex = 'a/a' pattern = re.compile(regex) odds=re.findall(pattern,htmltext) print odds [] --- -- https://mail.python.org/mailman/listinfo/python-list
Re: Suitable Python code to scrape specific details from web pages.
On Tue, 12 Aug 2014 13:00:30 -0700 (PDT) Simon Evans musicalhack...@yahoo.co.uk wrote: Dear Programmers, I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. I have tried a few of his python programs in the Python27 command prompt, but altered them from accessing data using links say from the Dow Jones index, to accessing the details I would be interested in accessing from the 'Racing Post' on a daily basis. Anyhow, the code it returns is not in the example I am going to give, is not the information I am seeking, instead of returning the given odds on a horse, it only returns a [], which isn't much use. I would be glad if you could tell me where I am going wrong. Yours faithfully Simon Evans. import urllib import re htmlfile = urllib.urlopen(http://www.racingpost.com/horses2/cards/card.sd? race_id=600048r_date=2014-05-08#raceTabs=sc_) htmltext = htmlfile.read() regex = 'strong1a href=http://www.racingpost.com/horses/horse_home.sd? horse_id=758752onclick=scorecards.send(quot;horse_namequot:):return Html.popup(this, {width:695,height:800})title=Full details about this HORSELively Baron/a9/4F/strongbr/' pattern = re.compile(regex) odds=re.findall(pattern,htmltext) print odds [] import urllib import re htmlfile = urllib.urlopen(http://www.racingpost.com/horses2/cards/card.sd? race_id=600048r_date=2014-05-08#raceTabs=sc_) htmltext = htmlfile.read() regex = 'a/a' pattern = re.compile(regex) odds=re.findall(pattern,htmltext) print odds [] --- If you want web scraping, you want to use http://www.crummy.com/software/BeautifulSoup/ . End of story. -- Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix. -- https://mail.python.org/mailman/listinfo/python-list
Re: Suitable Python code to scrape specific details from web pages.
In article a8f10c4f-d4a0-48ed-ae92-2a43e9a09...@googlegroups.com, Simon Evans musicalhack...@yahoo.co.uk wrote: Dear Programmers, I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. I have tried a few of his python programs in the Python27 command prompt, but altered them from accessing data using links say from the Dow Jones index, to accessing the details I would be interested in accessing from the 'Racing Post' on a daily basis. Anyhow, the code it returns is not in the example I am going to give, is not the information I am seeking, instead of returning the given odds on a horse, it only returns a [], which isn't much use. I would be glad if you could tell me where I am going wrong. Rather than comment on your specific code (but, thank you for posting it), I'll make a couple of more generic suggestions. First, if you're doing anything with fetching web pages, install the wonderful requests module (http://docs.python-requests.org/en/latest/). It's so much easier to work with than urllib. Second, if you're going to be parsing web pages, trying to use regexes is a losing game. You need something that knows how to parse HTML. The canonical answer is lxml (http://lxml.de/), but Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) is less intimidating to use. -- https://mail.python.org/mailman/listinfo/python-list
Re: Suitable Python code to scrape specific details from web pages.
On Tuesday, August 12, 2014 9:00:30 PM UTC+1, Simon Evans wrote: Dear Programmers, I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. I have tried a few of his python programs in the Python27 command prompt, but altered them from accessing data using links say from the Dow Jones index, to accessing the details I would be interested in accessing from the 'Racing Post' on a daily basis. Anyhow, the code it returns is not in the example I am going to give, is not the information I am seeking, instead of returning the given odds on a horse, it only returns a [], which isn't much use. I would be glad if you could tell me where I am going wrong. Yours faithfully Simon Evans. import urllib import re htmlfile = urllib.urlopen(http://www.racingpost.com/horses2/cards/card.sd? race_id=600048r_date=2014-05-08#raceTabs=sc_) htmltext = htmlfile.read() regex = 'strong1a href=http://www.racingpost.com/horses/horse_home.sd? horse_id=758752onclick=scorecards.send(quot;horse_namequot:):return Html.popup(this, {width:695,height:800})title=Full details about this HORSELively Baron/a9/4F/strongbr/' pattern = re.compile(regex) odds=re.findall(pattern,htmltext) print odds [] import urllib import re htmlfile = urllib.urlopen(http://www.racingpost.com/horses2/cards/card.sd? race_id=600048r_date=2014-05-08#raceTabs=sc_) htmltext = htmlfile.read() regex = 'a/a' pattern = re.compile(regex) odds=re.findall(pattern,htmltext) print odds [] --- Dear Programmers, Thank you for your responses. I have installed 'Beautiful Soup' and I have the 'Getting Started in Beautiful Soup' book, but can't seem to make any progress with it, I am too thick to make much use of it. I was hoping I could scrape specified stuff off Web pages without using it. I have installed 'Requests' also, is there any code I can use that you can suggest that can access the sort of Web page values that I have referred to ? such as odds, names of runners, stuff like that off the 'inspect element' or 'source' htaml pages, on www.Racingpost.com. -- https://mail.python.org/mailman/listinfo/python-list
Re: Suitable Python code to scrape specific details from web pages.
Simon Evans wrote: Dear Programmers, Thank you for your responses. I have installed 'Beautiful Soup' and I have the 'Getting Started in Beautiful Soup' book, but can't seem to make any progress with it, I am too thick to make much use of it. I was hoping I could scrape specified stuff off Web pages without using it. Yes, you can scrape stuff off web pages without programming. What you do is you open the web page in your browser, then open a notebook and, with a pencil or pen, copy the bits you read into the notebook. If you're very skilled, you can avoid the pencil and paper and type directly into a text editor on the computer. But other than that, every website is different, so there is no short-cut to web scraping. You need to customize the scraping code for each website you scrape, and that means programming. Do you know how to program? Are you interested in learning? If the answer is No and No, then I suggestion you pony up some money and pay somebody who already knows how to program to do the job for you. If the answer is No and Yes, then start at the beginning. Do some programming tutorials, learn to program the basics before moving on to something moderately difficult like web scraping. If the answer is that you already know how to program, but just don't know how to do web scraping, then stick with it and you'll get there. Web scraping is tricky, but possible, and if you work hard at it you'll succeed. Unless you're an experienced programmer with all the right skills, don't expect this to be something you do in a few minutes. Depending on your level of experience, you could expect to spend dozens of hours to learn how to scrape a single website. (Fortunately, the second website will probably be a little easier, and the third easier still. By the time you've done a dozen, you'll wonder what the fuss was about.) By studying how other scraping programs work, and studying how your racing pages store data, you should be able to put the two together and see how to get the data you want. There's plenty of information to help you learn how to web scrape, with or without BeautifulSoup: https://startpage.com/do/search/?q=beautifulsoup+web+scraping https://ixquick.com/do/search/?q=python+web+scraping+examples https://duckduckgo.com/html/?q=requests%20python%20web%20scraping%20example but no alternative to actually writing code. I have installed 'Requests' also, is there any code I can use that you can suggest that can access the sort of Web page values that I have referred to ? such as odds, names of runners, stuff like that off the 'inspect element' or 'source' htaml pages, on www.Racingpost.com. Specifically those pages? Doubtful. If you are really lucky (1) somebody else has already done the programming, (2) they've made their program available to others, and (3) you can find that program on the Internet. Use the search engine of your choice to search for it. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Suitable Python code to scrape specific details from web pages.
In article 53eaab7d$0$29979$c3e8da3$54964...@news.astraweb.com, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: By studying how other scraping programs work, and studying how your racing pages store data, you should be able to put the two together and see how to get the data you want. It's also worth mentioning, that some web sites *want* you to have their data, and make it easy to do so by exposing it via public APIs or other download methods. Wikipedia. Many government web sites. Twitter. Facebook. Reddit. Whenever you start thinking about web scraping, it's always worth spending a little time investigating if such an API exists. If it does, that's where you want to go. If not, well, there's always Beautiful Soup :-) -- https://mail.python.org/mailman/listinfo/python-list
Re: Suitable Python code to scrape specific details from web pages.
On Tue, 12 Aug 2014 15:44:58 -0700 (PDT), Simon Evans wrote: [snip] Dear Programmers, Thank you for your responses. I have installed 'Beautiful Soup' and I have the 'Getting Started in Beautiful Soup' book, but can't seem to make any progress with it, I am too thick to make much use of it. I was hoping I could scrape specified stuff off Web pages without using it. I've only used BeautifulSoup a little bit, and am no expert, but with it one can do wonderfully complex things with simple code. Perhaps you can find some examples online; this newsgroup sometimes has awesome demonstrations of BS prowess. At the risk of embarrassing myself in public, I'll show you some code I wrote that scrapes data from a web page containing a description of a drug. The drug's web page contains the desired data in tags that look like this: input id=form-widgets-minconcentration name=form.widgets.minconcentration class=text-widget float-field value=1.0 type=text / The following code finds all these tags and builds a dict by which you can lookup the value for any given name. from BeautifulSoup import BeautifulSoup as BS ... def dump_drug_data(url): Fetch data from one drug's URL and print selected fields in columns. contents = urllib2.urlopen(url=url).read() soup = BS(contents) inputs = soup.findAll(input) input_dict = dict((i.get(name), i.get(value)) for i in inputs) print( .join(f.format(input_dict[n]) for f, n in ( ({0:5s}, form.widgets.absorption_halflife), ({0:5s}, form.widgets.elimination_halflife), ({0:5s}, form.widgets.minconcentration), ({0:5s}, form.widgets.maxconcentration), ({0:13s}, form.widgets.title), ))) Try giving a more specific picture of your quest, and it's very likely that people smarter than me will give you good help. -- To email me, substitute nowhere-spamcop, invalid-net. -- https://mail.python.org/mailman/listinfo/python-list