Re: Screen scraper to get all 'a title' elements
On Wed, 25 Nov 2015 12:42:00 -0800, ryguy7272 wrote: > Hello experts. I'm looking at this url: > https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names > > I'm trying to figure out how to list all 'a title' elements. a is the element tag, title is an attribute of the htmlanchorelement. combining bs4 with python structures allows you to find all the specified attributes of an element type, for example to find the class attributes of all the paragraphs with a class attribute: stuff = [p.attrs['class'] for p in soup.find_all('p') if 'class' in p.attrs] Then you can do this for thing in stuff: print thing (Python 2.7) This may be adaptable to your requirement. -- Denis McMahon, denismfmcma...@gmail.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
Hi It seems that links on that Wikipedia page follow the structure : You could extract a list of link titles with something like : re.findall( r'\]+title="(.+?)"', html ) HTH, -Grobu- On 25/11/15 21:55, MRAB wrote: On 2015-11-25 20:42, ryguy7272 wrote: Hello experts. I'm looking at this url: https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names I'm trying to figure out how to list all 'a title' elements. For instance, I see the following: Accident Ala-Lemu Alert Apocalypse Peaks So, I tried putting a script together to get 'title'. Here's my attempt. import requests import sys from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names; source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text) for link in soup.findAll('title'): print(link) All that does is get the title of the page. I tried to get the links from that url, with this script. A 'title' element has the form "". What you should be looking for are 'a' elements, those of the form "". import urllib2 import re #connect to a URL website = urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names') #read html code html = website.read() #use re.findall to get all the links links = re.findall('"((http|ftp)s?://.*?)"', html) print links That doesn't work wither. Basically, I'd like to see this. Accident Ala-Lemu Alert Apocalypse Peaks Athol Å Barbecue Båstad Bastardstown Batman Bathmen (Battem), Netherlands ... Worms Yell Zigzag Zzyzx How can I do that? Thanks all!! -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
On Thu, Nov 26, 2015 at 10:37 AM, ryguy7272wrote: > Wow! Awesome! I bookmarked that link! > Thanks for everything!!! Also bookmark this link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags And read it before you do any parsing of HTML using regular expressions. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
On 26/11/15 00:06, Chris Angelico wrote: On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272wrote: Thanks!! Is that regex? Can you explain exactly what it is doing? Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that. Can you just please explain what it's doing??? It's a trap! Don't use a regex to parse HTML, unless you're deliberately trying to entice young and innocent programmers to the dark side. ChrisA Sorry, I wasn't aware of regex being on the dark side :-) Now that you mention it, I suppose that their being complex and error-inducing could lead to broken code all too easily when there is a reliable, ready-made solution like BeautifulSoup. -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
On Thu, Nov 26, 2015 at 10:44 AM, Grobuwrote: > On 26/11/15 00:06, Chris Angelico wrote: >> >> On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272 wrote: >>> >>> Thanks!! Is that regex? Can you explain exactly what it is doing? >>> Also, it seems to pick up a lot more than just the list I wanted, but >>> that's ok, I can see why it does that. >>> >>> Can you just please explain what it's doing??? >> >> >> It's a trap! >> >> Don't use a regex to parse HTML, unless you're deliberately trying to >> entice young and innocent programmers to the dark side. >> >> ChrisA >> > > Sorry, I wasn't aware of regex being on the dark side :-) > Now that you mention it, I suppose that their being complex and > error-inducing could lead to broken code all too easily when there is a > reliable, ready-made solution like BeautifulSoup. Regular expressions have their uses, but parsing HTML is not one of them. The most important use of a regex is letting an end user control the search pattern; it's a compact language for describing a variety of text search concepts. For hard-coded regular expressions, there are some places where they're very good, and a lot of places where they're the wrong tool for the job. And one of those wrong-tool-for-job places is parsing stuff that fundamentally cannot be parsed with regexes, such as HTML. You _need_ a proper parser, which is what Beautiful Soup is for. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272wrote: > Thanks!! Is that regex? Can you explain exactly what it is doing? > Also, it seems to pick up a lot more than just the list I wanted, but that's > ok, I can see why it does that. > > Can you just please explain what it's doing??? It's a trap! Don't use a regex to parse HTML, unless you're deliberately trying to entice young and innocent programmers to the dark side. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
On 25/11/15 23:48, ryguy7272 wrote: re.findall( r'\]+title="(.+?)"', html ) [ ... ] Thanks!! Is that regex? Can you explain exactly what it is doing? Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that. Can you just please explain what it's doing??? Yes it's a regular expression. Because RegEx's use the backslash as an escape character, it is advisable to use the "raw string" prefix (r before single/double/triple quote. To illustrate it with an example : >>> print "1\n2" 1 2 >>> print r"1\n2" 1\n2 As the backslash escape character is "neutralized" by the raw string, you can use the usual RegEx syntax at leisure : \]+title="(.+?)" \< was a mistake on my part, a single < is perfectly enough [^>] is a class definition, and the caret (^) character indicates negation. Thus it means : any character other than > + incidates repetition : one or more of the previous element . will match just anything .+" is a _greedy_ pattern that would match anything until it encountered a double quote The problem with a greedy pattern is that it doesn't stop at the first match. To illustrate : >>> a = re.search( r'".+"', 'title="this is a test" class="test"' ) >>> a.group() '"this is a test" class="test"' It matches the first quote up to the last one. On the other hand, you can use the "?" modifier to specify a non-greedy pattern : >>> b = re.search( r'".+?"', 'title="this is a test" class="test"' ) '"this is a test"' It matches the first quote and stops looking for further matches after the second quote. Finally, the parentheses are used to indicate a capture group : >>> a = re.search( r'"this (is) a (.+?)"', 'title="this is a test" class="test"' ) >>> a.groups() ('is', 'test') You can find detailed explanations about Python regular expressions at this page : https://docs.python.org/2/howto/regex.html HTH, -Grobu- -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
Grobu: > Sorry, I wasn't aware of regex being on the dark side :-) No, regular expressions are great for many purposes. Parsing context-free syntax isn't one of them. See: https://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy> Most modern programming languages including HTML are context-free. Their structure is too rich for regular expressions to capture. Regular expressions can handle any regular language just fine. They are commonly used to define the lexical tokens of a language. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
On Wednesday, November 25, 2015 at 5:30:14 PM UTC-5, Grobu wrote: > Hi > > It seems that links on that Wikipedia page follow the structure : > > > You could extract a list of link titles with something like : > re.findall( r'\]+title="(.+?)"', html ) > > HTH, > > -Grobu- > > > On 25/11/15 21:55, MRAB wrote: > > On 2015-11-25 20:42, ryguy7272 wrote: > >> Hello experts. I'm looking at this url: > >> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names > >> > >> I'm trying to figure out how to list all 'a title' elements. For > >> instance, I see the following: > >> >> href="/wiki/Accident,_Maryland">Accident > >> >> href="/w/index.php?title=Ala-Lemu=edit=1">Ala-Lemu > >> Alert > >> Apocalypse > >> Peaks > >> > >> So, I tried putting a script together to get 'title'. Here's my attempt. > >> > >> import requests > >> import sys > >> from bs4 import BeautifulSoup > >> > >> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names; > >> source_code = requests.get(url) > >> plain_text = source_code.text > >> soup = BeautifulSoup(plain_text) > >> for link in soup.findAll('title'): > >> print(link) > >> > >> All that does is get the title of the page. I tried to get the links > >> from that url, with this script. > >> > > A 'title' element has the form "". What you should be looking > > for are 'a' elements, those of the form "". > > > >> import urllib2 > >> import re > >> > >> #connect to a URL > >> website = > >> urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names') > >> > >> > >> #read html code > >> html = website.read() > >> > >> #use re.findall to get all the links > >> links = re.findall('"((http|ftp)s?://.*?)"', html) > >> > >> print links > >> > >> That doesn't work wither. Basically, I'd like to see this. > >> > >> Accident > >> Ala-Lemu > >> Alert > >> Apocalypse Peaks > >> Athol > >> Å > >> Barbecue > >> Båstad > >> Bastardstown > >> Batman > >> Bathmen (Battem), Netherlands > >> ... > >> Worms > >> Yell > >> Zigzag > >> Zzyzx > >> > >> How can I do that? > >> Thanks all!! Thanks!! Is that regex? Can you explain exactly what it is doing? Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that. Can you just please explain what it's doing??? -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
On Wednesday, November 25, 2015 at 6:34:00 PM UTC-5, Grobu wrote: > On 25/11/15 23:48, ryguy7272 wrote: > >> re.findall( r'\]+title="(.+?)"', html ) > [ ... ] > > Thanks!! Is that regex? Can you explain exactly what it is doing? > > Also, it seems to pick up a lot more than just the list I wanted, but > > that's ok, I can see why it does that. > > > > Can you just please explain what it's doing??? > > > > Yes it's a regular expression. Because RegEx's use the backslash as an > escape character, it is advisable to use the "raw string" prefix (r > before single/double/triple quote. To illustrate it with an example : > >>> print "1\n2" > 1 > 2 > >>> print r"1\n2" > 1\n2 > As the backslash escape character is "neutralized" by the raw string, > you can use the usual RegEx syntax at leisure : > > \]+title="(.+?)" > > \[^>] is a class definition, and the caret (^) character indicates > negation. Thus it means : any character other than > > + incidates repetition : one or more of the previous element > . will match just anything > .+" is a _greedy_ pattern that would match anything until it encountered > a double quote > > The problem with a greedy pattern is that it doesn't stop at the first > match. To illustrate : > >>> a = re.search( r'".+"', 'title="this is a test" class="test"' ) > >>> a.group() > '"this is a test" class="test"' > > It matches the first quote up to the last one. > On the other hand, you can use the "?" modifier to specify a non-greedy > pattern : > > >>> b = re.search( r'".+?"', 'title="this is a test" class="test"' ) > '"this is a test"' > > It matches the first quote and stops looking for further matches after > the second quote. > > Finally, the parentheses are used to indicate a capture group : > >>> a = re.search( r'"this (is) a (.+?)"', 'title="this is a test" > class="test"' ) > >>> a.groups() > ('is', 'test') > > > You can find detailed explanations about Python regular expressions at > this page : https://docs.python.org/2/howto/regex.html > > HTH, > > -Grobu- Wow! Awesome! I bookmarked that link! Thanks for everything!!! -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
On Thu, Nov 26, 2015 at 10:53 AM, Marko Rauhamaawrote: > Regular expressions can handle any regular language just fine. They are > commonly used to define the lexical tokens of a language. Not sure about _defining_ them, but they're certainly often used to _recognize_ them, eg in syntax highlighters. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
Chris, Marko, thank you both for your links and explanations! -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
On Wed, Nov 25, 2015 at 12:42 PM, ryguy7272wrote: > Hello experts. I'm looking at this url: > https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names Wildly offtopic but interesting, easy way to grab/analyze Wikipedia data using F# instead of Python http://evelinag.com/blog/2015/11-18-f-tackles-james-bond/ In your particular case something like: open FSharp.Data let [] wikiURL = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names; type PlaceNamesProvider = HtmlProvider let placeNamesWiki = PlaceNamesProvider() for row in placeNamesWiki.Tables.``Short & medium length names``.Rows do printfn "%s" row.Column1 -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
On Wednesday, November 25, 2015 at 3:42:21 PM UTC-5, ryguy7272 wrote: > Hello experts. I'm looking at this url: > https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names > > I'm trying to figure out how to list all 'a title' elements. For instance, I > see the following: > Accident > href="/w/index.php?title=Ala-Lemu=edit=1">Ala-Lemu > Alert > Apocalypse Peaks > > So, I tried putting a script together to get 'title'. Here's my attempt. > > import requests > import sys > from bs4 import BeautifulSoup > > url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names; > source_code = requests.get(url) > plain_text = source_code.text > soup = BeautifulSoup(plain_text) > for link in soup.findAll('title'): > print(link) > > All that does is get the title of the page. I tried to get the links from > that url, with this script. > > import urllib2 > import re > > #connect to a URL > website = > urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names') > > #read html code > html = website.read() > > #use re.findall to get all the links > links = re.findall('"((http|ftp)s?://.*?)"', html) > > print links > > That doesn't work wither. Basically, I'd like to see this. > > Accident > Ala-Lemu > Alert > Apocalypse Peaks > Athol > Å > Barbecue > Båstad > Bastardstown > Batman > Bathmen (Battem), Netherlands > ... > Worms > Yell > Zigzag > Zzyzx > > How can I do that? > Thanks all!! Ok, I guess that makes sense. So, I just tried the script below, and got nothing... import requests from bs4 import BeautifulSoup r = requests.get("https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names;) soup = BeautifulSoup(r.content) print soup.find_all("a",{"title"}) -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
On Thu, Nov 26, 2015 at 9:04 AM, ryguy7272wrote: > Ok, I guess that makes sense. So, I just tried the script below, and got > nothing... > > import requests > from bs4 import BeautifulSoup > > r = > requests.get("https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names;) > soup = BeautifulSoup(r.content) > print soup.find_all("a",{"title"}) The second argument to find_all is supposed to be a dict, not a set, and it's only useful if you want to put some restriction on the titles. To simply enumerate all the titles, try this: [a.get("title") for a in soup.find_all("a")] ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
On 2015-11-25 20:42, ryguy7272 wrote: Hello experts. I'm looking at this url: https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names I'm trying to figure out how to list all 'a title' elements. For instance, I see the following: Accident Ala-Lemu Alert Apocalypse Peaks So, I tried putting a script together to get 'title'. Here's my attempt. import requests import sys from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names; source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text) for link in soup.findAll('title'): print(link) All that does is get the title of the page. I tried to get the links from that url, with this script. A 'title' element has the form "". What you should be looking for are 'a' elements, those of the form "". import urllib2 import re #connect to a URL website = urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names') #read html code html = website.read() #use re.findall to get all the links links = re.findall('"((http|ftp)s?://.*?)"', html) print links That doesn't work wither. Basically, I'd like to see this. Accident Ala-Lemu Alert Apocalypse Peaks Athol Å Barbecue Båstad Bastardstown Batman Bathmen (Battem), Netherlands ... Worms Yell Zigzag Zzyzx How can I do that? Thanks all!! -- https://mail.python.org/mailman/listinfo/python-list
Screen scraper to get all 'a title' elements
Hello experts. I'm looking at this url: https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names I'm trying to figure out how to list all 'a title' elements. For instance, I see the following: Accident Ala-Lemu Alert Apocalypse Peaks So, I tried putting a script together to get 'title'. Here's my attempt. import requests import sys from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names; source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text) for link in soup.findAll('title'): print(link) All that does is get the title of the page. I tried to get the links from that url, with this script. import urllib2 import re #connect to a URL website = urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names') #read html code html = website.read() #use re.findall to get all the links links = re.findall('"((http|ftp)s?://.*?)"', html) print links That doesn't work wither. Basically, I'd like to see this. Accident Ala-Lemu Alert Apocalypse Peaks Athol Å Barbecue Båstad Bastardstown Batman Bathmen (Battem), Netherlands ... Worms Yell Zigzag Zzyzx How can I do that? Thanks all!! -- https://mail.python.org/mailman/listinfo/python-list