Re: Suitable Python code to scrape specific details from web pages.

2014-08-17 Thread alex23

On 13/08/2014 7:28 AM, Roy Smith wrote:

Second, if you're going to be parsing web pages, trying to use regexes
is a losing game.  You need something that knows how to parse HTML.  The
canonical answer is lxml (http://lxml.de/), but Beautiful Soup
(http://www.crummy.com/software/BeautifulSoup/) is less intimidating to
use.


lxml also has a BeautifulSoup parser, so you can easily mix and match 
approaches:


http://lxml.de/elementsoup.html

--
https://mail.python.org/mailman/listinfo/python-list


Re: Suitable Python code to scrape specific details from web pages.

2014-08-13 Thread Denis McMahon
On Tue, 12 Aug 2014 13:00:30 -0700, Simon Evans wrote:

 in accessing from the 'Racing Post' on a daily basis. Anyhow, the code

Following is some starter code. You will have to look at the output, 
compare it to the web page, and work out how you want to process it 
further. Note that I use beautifulsoup and requests. The output is the 
html for each cell in the table with a line of + characters at the 
table row breaks. I suggest you look at the beautifulsoup documentation 
at http://www.crummy.com/software/BeautifulSoup/bs4/doc/ to work out how 
you may wish to select which table cells contain data you are interested 
in and how to extract it.

#!/usr/bin/python

Program to extract data from racingpost.


from bs4 import BeautifulSoup
import requests

r = requests.get( http://www.racingpost.com/horses2/cards/card.sd?
race_id=607466r_date=2014-08-13#raceTabs=sc_ )

if r.status_code == 200:
soup = BeautifulSoup( r.content )
table = soup.find( table, id=sc_horseCard )
for row in table.find_all( tr ):
for cell in row.find_all( td ):
print cell
print +
else:
print HTTP Status, r.status_code

-- 
Denis McMahon, denismfmcma...@gmail.com
-- 
https://mail.python.org/mailman/listinfo/python-list


Suitable Python code to scrape specific details from web pages.

2014-08-12 Thread Simon Evans
Dear Programmers,
I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. I 
have tried a few of his python programs in the Python27 command prompt, but 
altered them from accessing data using links say from the Dow Jones index, to 
accessing the details I would be interested in accessing from the 'Racing Post' 
on a daily basis. Anyhow, the code it returns is not in the example I am going 
to give, is not the information I am seeking, instead of returning the given 
odds on a horse, it only returns a [], which isn't much use. 
I would be glad if you could tell me where I am going wrong. 
Yours faithfully
Simon Evans.

import urllib
import re
htmlfile = urllib.urlopen(http://www.racingpost.com/horses2/cards/card.sd?

race_id=600048r_date=2014-05-08#raceTabs=sc_)
htmltext = htmlfile.read()
regex = 'strong1a href=http://www.racingpost.com/horses/horse_home.sd?

horse_id=758752onclick=scorecards.send(quot;horse_namequot:):return 
Html.popup(this,

{width:695,height:800})title=Full details about this HORSELively 

Baron/a9/4F/strongbr/'
pattern = re.compile(regex)
odds=re.findall(pattern,htmltext)
print odds
[]


import urllib
import re
htmlfile = urllib.urlopen(http://www.racingpost.com/horses2/cards/card.sd?

race_id=600048r_date=2014-05-08#raceTabs=sc_)
htmltext = htmlfile.read()
regex = 'a/a'
pattern = re.compile(regex)
odds=re.findall(pattern,htmltext)
print odds
[]

---
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Suitable Python code to scrape specific details from web pages.

2014-08-12 Thread Rob Gaddi
On Tue, 12 Aug 2014 13:00:30 -0700 (PDT)
Simon Evans musicalhack...@yahoo.co.uk wrote:

 Dear Programmers,
 I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. 
 I have tried a few of his python programs in the Python27 command prompt, but 
 altered them from accessing data using links say from the Dow Jones index, to 
 accessing the details I would be interested in accessing from the 'Racing 
 Post' on a daily basis. Anyhow, the code it returns is not in the example I 
 am going to give, is not the information I am seeking, instead of returning 
 the given odds on a horse, it only returns a [], which isn't much use. 
 I would be glad if you could tell me where I am going wrong. 
 Yours faithfully
 Simon Evans.
 
 import urllib
 import re
 htmlfile = urllib.urlopen(http://www.racingpost.com/horses2/cards/card.sd?
 
 race_id=600048r_date=2014-05-08#raceTabs=sc_)
 htmltext = htmlfile.read()
 regex = 'strong1a href=http://www.racingpost.com/horses/horse_home.sd?
 
 horse_id=758752onclick=scorecards.send(quot;horse_namequot:):return 
 Html.popup(this,
 
 {width:695,height:800})title=Full details about this HORSELively 
 
 Baron/a9/4F/strongbr/'
 pattern = re.compile(regex)
 odds=re.findall(pattern,htmltext)
 print odds
 []
 
 
 import urllib
 import re
 htmlfile = urllib.urlopen(http://www.racingpost.com/horses2/cards/card.sd?
 
 race_id=600048r_date=2014-05-08#raceTabs=sc_)
 htmltext = htmlfile.read()
 regex = 'a/a'
 pattern = re.compile(regex)
 odds=re.findall(pattern,htmltext)
 print odds
 []
 
 ---

If you want web scraping, you want to use
http://www.crummy.com/software/BeautifulSoup/ .  End of story.

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Suitable Python code to scrape specific details from web pages.

2014-08-12 Thread Roy Smith
In article a8f10c4f-d4a0-48ed-ae92-2a43e9a09...@googlegroups.com,
 Simon Evans musicalhack...@yahoo.co.uk wrote:

 Dear Programmers,
 I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. 
 I have tried a few of his python programs in the Python27 command prompt, but 
 altered them from accessing data using links say from the Dow Jones index, to 
 accessing the details I would be interested in accessing from the 'Racing 
 Post' on a daily basis. Anyhow, the code it returns is not in the example I 
 am going to give, is not the information I am seeking, instead of returning 
 the given odds on a horse, it only returns a [], which isn't much use. 
 I would be glad if you could tell me where I am going wrong. 

Rather than comment on your specific code (but, thank you for posting 
it), I'll make a couple of more generic suggestions.

First, if you're doing anything with fetching web pages, install the 
wonderful requests module (http://docs.python-requests.org/en/latest/).  
It's so much easier to work with than urllib.

Second, if you're going to be parsing web pages, trying to use regexes 
is a losing game.  You need something that knows how to parse HTML.  The 
canonical answer is lxml (http://lxml.de/), but Beautiful Soup 
(http://www.crummy.com/software/BeautifulSoup/) is less intimidating to 
use.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Suitable Python code to scrape specific details from web pages.

2014-08-12 Thread Simon Evans
On Tuesday, August 12, 2014 9:00:30 PM UTC+1, Simon Evans wrote:
 Dear Programmers,
 
 I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. 
 I have tried a few of his python programs in the Python27 command prompt, but 
 altered them from accessing data using links say from the Dow Jones index, to 
 accessing the details I would be interested in accessing from the 'Racing 
 Post' on a daily basis. Anyhow, the code it returns is not in the example I 
 am going to give, is not the information I am seeking, instead of returning 
 the given odds on a horse, it only returns a [], which isn't much use. 
 
 I would be glad if you could tell me where I am going wrong. 
 
 Yours faithfully
 
 Simon Evans.
 
 
 
 import urllib
 
 import re
 
 htmlfile = urllib.urlopen(http://www.racingpost.com/horses2/cards/card.sd?
 
 
 
 race_id=600048r_date=2014-05-08#raceTabs=sc_)
 
 htmltext = htmlfile.read()
 
 regex = 'strong1a href=http://www.racingpost.com/horses/horse_home.sd?
 
 
 
 horse_id=758752onclick=scorecards.send(quot;horse_namequot:):return 
 Html.popup(this,
 
 
 
 {width:695,height:800})title=Full details about this HORSELively 
 
 
 
 Baron/a9/4F/strongbr/'
 
 pattern = re.compile(regex)
 
 odds=re.findall(pattern,htmltext)
 
 print odds
 
 []
 
 
 
 
 
 import urllib
 
 import re
 
 htmlfile = urllib.urlopen(http://www.racingpost.com/horses2/cards/card.sd?
 
 
 
 race_id=600048r_date=2014-05-08#raceTabs=sc_)
 
 htmltext = htmlfile.read()
 
 regex = 'a/a'
 
 pattern = re.compile(regex)
 
 odds=re.findall(pattern,htmltext)
 
 print odds
 
 []
 
 
 
 ---
Dear Programmers, Thank you for your responses. I have installed 'Beautiful 
Soup' and I have the 'Getting Started in Beautiful Soup' book, but can't seem 
to make  any progress with it, I am too thick to make much use of it. I was 
hoping I could scrape specified stuff off Web pages without using it. I have 
installed 'Requests' also, is there any code I can use that you can suggest 
that can access the sort of Web page values that I have referred to ?  such as 
odds, names of runners, stuff like that off the 'inspect element' or 'source' 
htaml pages, on www.Racingpost.com. 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Suitable Python code to scrape specific details from web pages.

2014-08-12 Thread Steven D'Aprano
Simon Evans wrote:

 Dear Programmers, Thank you for your responses. I have installed
 'Beautiful Soup' and I have the 'Getting Started in Beautiful Soup' book,
 but can't seem to make  any progress with it, I am too thick to make much
 use of it. I was hoping I could scrape specified stuff off Web pages
 without using it.

Yes, you can scrape stuff off web pages without programming. What you do is
you open the web page in your browser, then open a notebook and, with a
pencil or pen, copy the bits you read into the notebook.

If you're very skilled, you can avoid the pencil and paper and type directly
into a text editor on the computer.

But other than that, every website is different, so there is no short-cut to
web scraping. You need to customize the scraping code for each website you
scrape, and that means programming. Do you know how to program? Are you
interested in learning? If the answer is No and No, then I suggestion you
pony up some money and pay somebody who already knows how to program to do
the job for you.

If the answer is No and Yes, then start at the beginning. Do some
programming tutorials, learn to program the basics before moving on to
something moderately difficult like web scraping.

If the answer is that you already know how to program, but just don't know
how to do web scraping, then stick with it and you'll get there. Web
scraping is tricky, but possible, and if you work hard at it you'll
succeed. Unless you're an experienced programmer with all the right skills,
don't expect this to be something you do in a few minutes. Depending on
your level of experience, you could expect to spend dozens of hours to
learn how to scrape a single website. (Fortunately, the second website will
probably be a little easier, and the third easier still. By the time you've
done a dozen, you'll wonder what the fuss was about.) 

By studying how other scraping programs work, and studying how your racing
pages store data, you should be able to put the two together and see how to
get the data you want. There's plenty of information to help you learn how
to web scrape, with or without BeautifulSoup:

https://startpage.com/do/search/?q=beautifulsoup+web+scraping

https://ixquick.com/do/search/?q=python+web+scraping+examples

https://duckduckgo.com/html/?q=requests%20python%20web%20scraping%20example

but no alternative to actually writing code.


 I have installed 'Requests' also, is there any code I 
 can use that you can suggest that can access the sort of Web page values
 that I have referred to ?  such as odds, names of runners, stuff like that
 off the 'inspect element' or 'source' htaml pages, on www.Racingpost.com.

Specifically those pages? Doubtful.

If you are really lucky (1) somebody else has already done the programming,
(2) they've made their program available to others, and (3) you can find
that program on the Internet. Use the search engine of your choice to
search for it.



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Suitable Python code to scrape specific details from web pages.

2014-08-12 Thread Roy Smith
In article 53eaab7d$0$29979$c3e8da3$54964...@news.astraweb.com,
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:

 By studying how other scraping programs work, and studying how your racing
 pages store data, you should be able to put the two together and see how to
 get the data you want.

It's also worth mentioning, that some web sites *want* you to have their 
data, and make it easy to do so by exposing it via public APIs or other 
download methods.  Wikipedia.  Many government web sites.  Twitter.  
Facebook.  Reddit.

Whenever you start thinking about web scraping, it's always worth 
spending a little time investigating if such an API exists.  If it does, 
that's where you want to go.  If not, well, there's always Beautiful 
Soup :-)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Suitable Python code to scrape specific details from web pages.

2014-08-12 Thread Peter Pearson
On Tue, 12 Aug 2014 15:44:58 -0700 (PDT), Simon Evans wrote:
[snip]
 Dear Programmers, Thank you for your responses. I have installed
 'Beautiful Soup' and I have the 'Getting Started in Beautiful Soup'
 book, but can't seem to make any progress with it, I am too thick to
 make much use of it. I was hoping I could scrape specified stuff off
 Web pages without using it.

I've only used BeautifulSoup a little bit, and am no expert, but
with it one can do wonderfully complex things with simple code.
Perhaps you can find some examples online; this newsgroup sometimes
has awesome demonstrations of BS prowess.

At the risk of embarrassing myself in public, I'll show you some
code I wrote that scrapes data from a web page containing a
description of a drug.  The drug's web page contains the desired
data in tags that look like this:

input id=form-widgets-minconcentration name=form.widgets.minconcentration
class=text-widget float-field value=1.0 type=text /

The following code finds all these tags and builds a dict by which you
can lookup the value for any given name.

from BeautifulSoup import BeautifulSoup as BS
...

def dump_drug_data(url):
Fetch data from one drug's URL and print selected fields in columns.

contents = urllib2.urlopen(url=url).read()
soup = BS(contents)
inputs = soup.findAll(input)
input_dict = dict((i.get(name), i.get(value)) for i in inputs)
print( .join(f.format(input_dict[n]) for f, n in (
({0:5s}, form.widgets.absorption_halflife),
({0:5s}, form.widgets.elimination_halflife),
({0:5s}, form.widgets.minconcentration),
({0:5s}, form.widgets.maxconcentration),
({0:13s}, form.widgets.title),
)))

Try giving a more specific picture of your quest, and it's very
likely that people smarter than me will give you good help.

-- 
To email me, substitute nowhere-spamcop, invalid-net.
-- 
https://mail.python.org/mailman/listinfo/python-list