Re: Screen scraper to get all 'a title' elements

2015-11-26 Thread Denis McMahon
On Wed, 25 Nov 2015 12:42:00 -0800, ryguy7272 wrote:

> Hello experts.  I'm looking at this url:
> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
> 
> I'm trying to figure out how to list all 'a title' elements.

a is the element tag, title is an attribute of the htmlanchorelement.

combining bs4 with python structures allows you to find all the specified 
attributes of an element type, for example to find the class attributes 
of all the paragraphs with a class attribute:

stuff = [p.attrs['class'] for p in soup.find_all('p') if 'class' in 
p.attrs]

Then you can do this

for thing in stuff:
print thing

(Python 2.7)

This may be adaptable to your requirement.

-- 
Denis McMahon, denismfmcma...@gmail.com
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Grobu

Hi

It seems that links on that Wikipedia page follow the structure :


You could extract a list of link titles with something like :
re.findall( r'\]+title="(.+?)"', html )

HTH,

-Grobu-


On 25/11/15 21:55, MRAB wrote:

On 2015-11-25 20:42, ryguy7272 wrote:

Hello experts.  I'm looking at this url:
https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names

I'm trying to figure out how to list all 'a title' elements.  For
instance, I see the following:
Accident
Ala-Lemu
Alert
Apocalypse
Peaks

So, I tried putting a script together to get 'title'.  Here's my attempt.

import requests
import sys
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names;
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('title'):
 print(link)

All that does is get the title of the page.  I tried to get the links
from that url, with this script.


A 'title' element has the form "". What you should be looking
for are 'a' elements, those of the form "".


import urllib2
import re

#connect to a URL
website =
urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')


#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

That doesn't work wither.  Basically, I'd like to see this.

Accident
Ala-Lemu
Alert
Apocalypse Peaks
Athol
Å
Barbecue
Båstad
Bastardstown
Batman
Bathmen (Battem), Netherlands
...
Worms
Yell
Zigzag
Zzyzx

How can I do that?
Thanks all!!


--
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Chris Angelico
On Thu, Nov 26, 2015 at 10:37 AM, ryguy7272  wrote:
> Wow!  Awesome!  I bookmarked that link!
> Thanks for everything!!!

Also bookmark this link:

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

And read it before you do any parsing of HTML using regular expressions.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Grobu

On 26/11/15 00:06, Chris Angelico wrote:

On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272  wrote:

Thanks!!  Is that regex?  Can you explain exactly what it is doing?
Also, it seems to pick up a lot more than just the list I wanted, but that's 
ok, I can see why it does that.

Can you just please explain what it's doing???


It's a trap!

Don't use a regex to parse HTML, unless you're deliberately trying to
entice young and innocent programmers to the dark side.

ChrisA



Sorry, I wasn't aware of regex being on the dark side :-)
Now that you mention it, I suppose that their being complex and 
error-inducing could lead to broken code all too easily when there is a 
reliable, ready-made solution like BeautifulSoup.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Chris Angelico
On Thu, Nov 26, 2015 at 10:44 AM, Grobu  wrote:
> On 26/11/15 00:06, Chris Angelico wrote:
>>
>> On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272  wrote:
>>>
>>> Thanks!!  Is that regex?  Can you explain exactly what it is doing?
>>> Also, it seems to pick up a lot more than just the list I wanted, but
>>> that's ok, I can see why it does that.
>>>
>>> Can you just please explain what it's doing???
>>
>>
>> It's a trap!
>>
>> Don't use a regex to parse HTML, unless you're deliberately trying to
>> entice young and innocent programmers to the dark side.
>>
>> ChrisA
>>
>
> Sorry, I wasn't aware of regex being on the dark side :-)
> Now that you mention it, I suppose that their being complex and
> error-inducing could lead to broken code all too easily when there is a
> reliable, ready-made solution like BeautifulSoup.

Regular expressions have their uses, but parsing HTML is not one of
them. The most important use of a regex is letting an end user control
the search pattern; it's a compact language for describing a variety
of text search concepts. For hard-coded regular expressions, there are
some places where they're very good, and a lot of places where they're
the wrong tool for the job. And one of those wrong-tool-for-job places
is parsing stuff that fundamentally cannot be parsed with regexes,
such as HTML. You _need_ a proper parser, which is what Beautiful Soup
is for.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Chris Angelico
On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272  wrote:
> Thanks!!  Is that regex?  Can you explain exactly what it is doing?
> Also, it seems to pick up a lot more than just the list I wanted, but that's 
> ok, I can see why it does that.
>
> Can you just please explain what it's doing???

It's a trap!

Don't use a regex to parse HTML, unless you're deliberately trying to
entice young and innocent programmers to the dark side.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Grobu


On 25/11/15 23:48, ryguy7272 wrote:

re.findall( r'\]+title="(.+?)"', html )

[ ... ]

Thanks!!  Is that regex?  Can you explain exactly what it is doing?
Also, it seems to pick up a lot more than just the list I wanted, but that's 
ok, I can see why it does that.

Can you just please explain what it's doing???



Yes it's a regular expression. Because RegEx's use the backslash as an 
escape character, it is advisable to use the "raw string" prefix (r 
before single/double/triple quote. To illustrate it with an example :

>>> print "1\n2"
1
2
>>> print r"1\n2"
1\n2
As the backslash escape character is "neutralized" by the raw string, 
you can use the usual RegEx syntax at leisure :


\]+title="(.+?)"

\<   was a mistake on my part, a single < is perfectly enough
[^>]	is a class definition, and the caret (^) character indicates 
negation. Thus it means : any character other than >

+   incidates repetition : one or more of the previous element
.   will match just anything
.+"	is a _greedy_ pattern that would match anything until it encountered 
a double quote


The problem with a greedy pattern is that it doesn't stop at the first 
match. To illustrate :

>>> a = re.search( r'".+"', 'title="this is a test" class="test"' )
>>> a.group()
'"this is a test" class="test"'

It matches the first quote up to the last one.
On the other hand, you can use the "?" modifier to specify a non-greedy 
pattern :


>>> b = re.search( r'".+?"', 'title="this is a test" class="test"' )
'"this is a test"'

It matches the first quote and stops looking for further matches after 
the second quote.


Finally, the parentheses are used to indicate a capture group :
>>> a = re.search( r'"this (is) a (.+?)"', 'title="this is a test" 
class="test"' )

>>> a.groups()
('is', 'test')


You can find detailed explanations about Python regular expressions at 
this page : https://docs.python.org/2/howto/regex.html


HTH,

-Grobu-

--
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Marko Rauhamaa
Grobu :

> Sorry, I wasn't aware of regex being on the dark side :-)

No, regular expressions are great for many purposes. Parsing
context-free syntax isn't one of them.

See:

  https://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy>

Most modern programming languages including HTML are context-free. Their
structure is too rich for regular expressions to capture.

Regular expressions can handle any regular language just fine. They are
commonly used to define the lexical tokens of a language.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread ryguy7272
On Wednesday, November 25, 2015 at 5:30:14 PM UTC-5, Grobu wrote:
> Hi
> 
> It seems that links on that Wikipedia page follow the structure :
> 
> 
> You could extract a list of link titles with something like :
> re.findall( r'\]+title="(.+?)"', html )
> 
> HTH,
> 
> -Grobu-
> 
> 
> On 25/11/15 21:55, MRAB wrote:
> > On 2015-11-25 20:42, ryguy7272 wrote:
> >> Hello experts.  I'm looking at this url:
> >> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
> >>
> >> I'm trying to figure out how to list all 'a title' elements.  For
> >> instance, I see the following:
> >>  >> href="/wiki/Accident,_Maryland">Accident
> >>  >> href="/w/index.php?title=Ala-Lemu=edit=1">Ala-Lemu
> >> Alert
> >> Apocalypse
> >> Peaks
> >>
> >> So, I tried putting a script together to get 'title'.  Here's my attempt.
> >>
> >> import requests
> >> import sys
> >> from bs4 import BeautifulSoup
> >>
> >> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names;
> >> source_code = requests.get(url)
> >> plain_text = source_code.text
> >> soup = BeautifulSoup(plain_text)
> >> for link in soup.findAll('title'):
> >>  print(link)
> >>
> >> All that does is get the title of the page.  I tried to get the links
> >> from that url, with this script.
> >>
> > A 'title' element has the form "". What you should be looking
> > for are 'a' elements, those of the form "".
> >
> >> import urllib2
> >> import re
> >>
> >> #connect to a URL
> >> website =
> >> urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
> >>
> >>
> >> #read html code
> >> html = website.read()
> >>
> >> #use re.findall to get all the links
> >> links = re.findall('"((http|ftp)s?://.*?)"', html)
> >>
> >> print links
> >>
> >> That doesn't work wither.  Basically, I'd like to see this.
> >>
> >> Accident
> >> Ala-Lemu
> >> Alert
> >> Apocalypse Peaks
> >> Athol
> >> Å
> >> Barbecue
> >> Båstad
> >> Bastardstown
> >> Batman
> >> Bathmen (Battem), Netherlands
> >> ...
> >> Worms
> >> Yell
> >> Zigzag
> >> Zzyzx
> >>
> >> How can I do that?
> >> Thanks all!!



Thanks!!  Is that regex?  Can you explain exactly what it is doing?
Also, it seems to pick up a lot more than just the list I wanted, but that's 
ok, I can see why it does that.  

Can you just please explain what it's doing???
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread ryguy7272
On Wednesday, November 25, 2015 at 6:34:00 PM UTC-5, Grobu wrote:
> On 25/11/15 23:48, ryguy7272 wrote:
> >> re.findall( r'\]+title="(.+?)"', html )
> [ ... ]
> > Thanks!!  Is that regex?  Can you explain exactly what it is doing?
> > Also, it seems to pick up a lot more than just the list I wanted, but 
> > that's ok, I can see why it does that.
> >
> > Can you just please explain what it's doing???
> >
> 
> Yes it's a regular expression. Because RegEx's use the backslash as an 
> escape character, it is advisable to use the "raw string" prefix (r 
> before single/double/triple quote. To illustrate it with an example :
>   >>> print "1\n2"
>   1
>   2
>   >>> print r"1\n2"
>   1\n2
> As the backslash escape character is "neutralized" by the raw string, 
> you can use the usual RegEx syntax at leisure :
> 
> \]+title="(.+?)"
> 
> \ [^>]  is a class definition, and the caret (^) character indicates 
> negation. Thus it means : any character other than >
> + incidates repetition : one or more of the previous element
> . will match just anything
> .+"   is a _greedy_ pattern that would match anything until it encountered 
> a double quote
> 
> The problem with a greedy pattern is that it doesn't stop at the first 
> match. To illustrate :
>  >>> a = re.search( r'".+"', 'title="this is a test" class="test"' )
>  >>> a.group()
> '"this is a test" class="test"'
> 
> It matches the first quote up to the last one.
> On the other hand, you can use the "?" modifier to specify a non-greedy 
> pattern :
> 
>  >>> b = re.search( r'".+?"', 'title="this is a test" class="test"' )
> '"this is a test"'
> 
> It matches the first quote and stops looking for further matches after 
> the second quote.
> 
> Finally, the parentheses are used to indicate a capture group :
>  >>> a = re.search( r'"this (is) a (.+?)"', 'title="this is a test" 
> class="test"' )
>  >>> a.groups()
> ('is', 'test')
> 
> 
> You can find detailed explanations about Python regular expressions at 
> this page : https://docs.python.org/2/howto/regex.html
> 
> HTH,
> 
> -Grobu-



Wow!  Awesome!  I bookmarked that link!  
Thanks for everything!!!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Chris Angelico
On Thu, Nov 26, 2015 at 10:53 AM, Marko Rauhamaa  wrote:
> Regular expressions can handle any regular language just fine. They are
> commonly used to define the lexical tokens of a language.

Not sure about _defining_ them, but they're certainly often used to
_recognize_ them, eg in syntax highlighters.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Grobu

Chris, Marko, thank you both for your links and explanations!
--
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread TP
On Wed, Nov 25, 2015 at 12:42 PM, ryguy7272  wrote:
> Hello experts.  I'm looking at this url:
> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names

Wildly offtopic but interesting, easy way to grab/analyze Wikipedia
data using F# instead of Python
http://evelinag.com/blog/2015/11-18-f-tackles-james-bond/

In your particular case something like:

open FSharp.Data
let [] wikiURL =
"https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names;
type PlaceNamesProvider = HtmlProvider

let placeNamesWiki = PlaceNamesProvider()
for row in placeNamesWiki.Tables.``Short & medium length names``.Rows do
  printfn "%s" row.Column1
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread ryguy7272
On Wednesday, November 25, 2015 at 3:42:21 PM UTC-5, ryguy7272 wrote:
> Hello experts.  I'm looking at this url:
> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
> 
> I'm trying to figure out how to list all 'a title' elements.  For instance, I 
> see the following:
> Accident
>  href="/w/index.php?title=Ala-Lemu=edit=1">Ala-Lemu
> Alert
> Apocalypse Peaks
> 
> So, I tried putting a script together to get 'title'.  Here's my attempt.
> 
> import requests
> import sys
> from bs4 import BeautifulSoup
> 
> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names; 
> source_code = requests.get(url) 
> plain_text = source_code.text
> soup = BeautifulSoup(plain_text)
> for link in soup.findAll('title'):
> print(link)
> 
> All that does is get the title of the page.  I tried to get the links from 
> that url, with this script.
> 
> import urllib2
> import re
> 
> #connect to a URL
> website = 
> urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
> 
> #read html code
> html = website.read()
> 
> #use re.findall to get all the links
> links = re.findall('"((http|ftp)s?://.*?)"', html)
> 
> print links
> 
> That doesn't work wither.  Basically, I'd like to see this.
> 
> Accident
> Ala-Lemu
> Alert
> Apocalypse Peaks
> Athol
> Å
> Barbecue
> Båstad
> Bastardstown
> Batman
> Bathmen (Battem), Netherlands
> ...
> Worms
> Yell
> Zigzag
> Zzyzx
> 
> How can I do that?
> Thanks all!!



Ok, I guess that makes sense.  So, I just tried the script below, and got 
nothing...

import requests
from bs4 import BeautifulSoup

r = requests.get("https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names;)
soup = BeautifulSoup(r.content)
print soup.find_all("a",{"title"})
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Chris Angelico
On Thu, Nov 26, 2015 at 9:04 AM, ryguy7272  wrote:
> Ok, I guess that makes sense.  So, I just tried the script below, and got 
> nothing...
>
> import requests
> from bs4 import BeautifulSoup
>
> r = 
> requests.get("https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names;)
> soup = BeautifulSoup(r.content)
> print soup.find_all("a",{"title"})

The second argument to find_all is supposed to be a dict, not a set,
and it's only useful if you want to put some restriction on the
titles. To simply enumerate all the titles, try this:

[a.get("title") for a in soup.find_all("a")]

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread MRAB

On 2015-11-25 20:42, ryguy7272 wrote:

Hello experts.  I'm looking at this url:
https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names

I'm trying to figure out how to list all 'a title' elements.  For instance, I 
see the following:
Accident
Ala-Lemu
Alert
Apocalypse Peaks

So, I tried putting a script together to get 'title'.  Here's my attempt.

import requests
import sys
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names;
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('title'):
 print(link)

All that does is get the title of the page.  I tried to get the links from that 
url, with this script.

A 'title' element has the form "". What you should be looking 
for are 'a' elements, those of the form "".



import urllib2
import re

#connect to a URL
website = 
urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

That doesn't work wither.  Basically, I'd like to see this.

Accident
Ala-Lemu
Alert
Apocalypse Peaks
Athol
Å
Barbecue
Båstad
Bastardstown
Batman
Bathmen (Battem), Netherlands
...
Worms
Yell
Zigzag
Zzyzx

How can I do that?
Thanks all!!




--
https://mail.python.org/mailman/listinfo/python-list


Screen scraper to get all 'a title' elements

2015-11-25 Thread ryguy7272
Hello experts.  I'm looking at this url:
https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names

I'm trying to figure out how to list all 'a title' elements.  For instance, I 
see the following:
Accident
Ala-Lemu
Alert
Apocalypse Peaks

So, I tried putting a script together to get 'title'.  Here's my attempt.

import requests
import sys
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names; 
source_code = requests.get(url) 
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('title'):
print(link)

All that does is get the title of the page.  I tried to get the links from that 
url, with this script.

import urllib2
import re

#connect to a URL
website = 
urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

That doesn't work wither.  Basically, I'd like to see this.

Accident
Ala-Lemu
Alert
Apocalypse Peaks
Athol
Å
Barbecue
Båstad
Bastardstown
Batman
Bathmen (Battem), Netherlands
...
Worms
Yell
Zigzag
Zzyzx

How can I do that?
Thanks all!!


-- 
https://mail.python.org/mailman/listinfo/python-list