Getting URL's

2006-05-19 Thread defcon8
How do I get all the URL's in a page?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Getting URL's

2006-05-19 Thread Ju Hui
use 
htmlparser or regular expression

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Getting URL's

2006-05-19 Thread defcon8
Thanks

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Getting URL's

2006-05-19 Thread Paul McGuire
defcon8 [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 How do I get all the URL's in a page?


pyparsing comes with a simple example that does this, too.

-- Paul
Download pyparsing at http://sourceforge.net/projects/pyparsing


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Getting URL's

2006-05-19 Thread softwindow
it is difficult to get all URL's in a page
you can use sgmllib module to parse html files
can get the standard href .

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Getting URL's

2006-05-19 Thread Paul McGuire
softwindow [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]

 it is difficult to get all URL's in a page
snip

Is this really so hard?:

-
from pyparsing import Literal,Suppress,CharsNotIn,CaselessLiteral,\
Word,dblQuotedString,alphanums,SkipTo,makeHTMLTags
import urllib

# extract all a anchor tags - makeHTMLTags defines a
# fairly robust pair of match patterns, not just tag,/tag
linkOpenTag,linkCloseTag = makeHTMLTags(a)
link = linkOpenTag + \
SkipTo(linkCloseTag).setResultsName(body) + \
linkCloseTag.suppress()

# read the HTML source from some random URL
serverListPage = urllib.urlopen( http://www.google.com; )
htmlText = serverListPage.read()
serverListPage.close()

# use the link grammar to scan the HTML source
for toks,strt,end in link.scanString(htmlText):
print toks.startA.href,-,toks.body

-
Prints:
/url?sa=ppref=igpval=2q=http://www.google.com/ig%3Fhl%3Den -
Personalized Home
https://www.google.com/accounts/Login?continue=http://www.google.com/hl=en -
Sign in
/imghp?hl=entab=wiie=UTF-8 - Images
http://groups.google.com/grphp?hl=entab=wgie=UTF-8 - Groups
http://news.google.com/nwshp?hl=entab=wnie=UTF-8 - News
http://froogle.google.com/frghp?hl=entab=wfie=UTF-8 - Froogle
/maphp?hl=entab=wlie=UTF-8 - Maps
/intl/en/options/ - morenbsp;raquo;
/advanced_search?hl=en - Advanced Search
/preferences?hl=en - Preferences
/language_tools?hl=en - Language Tools
/intl/en/ads/ - Advertisingnbsp;Programs
/services/ - Business Solutions
/intl/en/about.html - About Google


-- Paul


-- 
http://mail.python.org/mailman/listinfo/python-list