why are the urls in the first set truncated?
On Aug 21, 8:07 am, Stef Mientki <[email protected]> wrote: > On 21-08-2010 14:46, mdipierro wrote: > > > what do you find that is strange? > > This is the result with the last letter removed, so all links should give an > error, > but they differ with the 2 methods, > and some of them produce 200, while they are definitely wrong > 404 500http://127.0.0.1:8000/welcome/default/user/logi > 404 500http://127.0.0.1:8000/welcome/default/user/registe > 404 500http://127.0.0.1:8000/welcome/default/user/request_reset_passwor > 200 500http://127.0.0.1:8000/welcome/default > 400 500http://127.0.0.1:8000/welcome/default/inde > 200 500http://127.0.0.1:8000/admin/default/design/welcom > 200 500http://127.0.0.1:8000/admin/default/edit/welcome/controllers/default.p > 200 > 500http://127.0.0.1:8000/admin/default/edit/welcome/views/default/index.htm > 200 500http://127.0.0.1:8000/admin/default/edit/welcome/views/layout.htm > 200 500http://127.0.0.1:8000/admin/default/edit/welcome/static/base.cs > 200 500http://127.0.0.1:8000/admin/default/edit/welcome/models/db.p > 200 500http://127.0.0.1:8000/admin/default/edit/welcome/models/menu.p > 400 500http://127.0.0.1:8000/welcome/appadmin/inde > 200 500http://127.0.0.1:8000/admin/default/inde > 400 400http://127.0.0.1:8000/examples/default/inde > 200 -1http://web2py.co > 400 400http://web2py.com/boo > 400 500http://127.0.0.1:8000/welcome/default/inde > 200 500http://127.0.0.1:8000/welcome/default > 200 500http://127.0.0.1:8000/admin/default/peek/welcome/controllers/default.p > 200 > 500http://127.0.0.1:8000/admin/default/peek/welcome/views/default/index.htm > 200 -1http://www.web2py.co > > This is the normal result > 200 500http://127.0.0.1:8000/welcome/default/user/login > 200 500http://127.0.0.1:8000/welcome/default/user/register > 200 500http://127.0.0.1:8000/welcome/default/user/request_reset_password > 200 500http://127.0.0.1:8000/welcome/default > 200 500http://127.0.0.1:8000/welcome/default/index > 200 500http://127.0.0.1:8000/admin/default/design/welcome > 200 500http://127.0.0.1:8000/admin/default/edit/welcome/controllers/default.py > 200 > 500http://127.0.0.1:8000/admin/default/edit/welcome/views/default/index.... > 200 500http://127.0.0.1:8000/admin/default/edit/welcome/views/layout.html > 200 500http://127.0.0.1:8000/admin/default/edit/welcome/static/base.css > 200 500http://127.0.0.1:8000/admin/default/edit/welcome/models/db.py > 200 500http://127.0.0.1:8000/admin/default/edit/welcome/models/menu.py > 200 500http://127.0.0.1:8000/welcome/appadmin/index > 200 500http://127.0.0.1:8000/admin/default/index > 200 200http://127.0.0.1:8000/examples/default/index > 200 200http://web2py.com > 200 500http://web2py.com/book > 200 500http://127.0.0.1:8000/welcome/default/index > 400 500http://127.0.0.1:8000/welcome/default/index# > 200 500http://127.0.0.1:8000/admin/default/peek/welcome/controllers/default.py > 200 > 500http://127.0.0.1:8000/admin/default/peek/welcome/views/default/index.... > 200 200http://www.web2py.com > > So when is a URL valid ? > > thanks, > Stef > > > On Aug 21, 7:32 am, Stef Mientki <[email protected]> wrote: > >>> Graphical representation of links or pages that don't get linked to. > >> I tried to test the links (with 2 algorithms, code below) in a generated > >> webpage, but the result I > >> get are very weird. > >> Probably one you knows a better way ? > > >> cheers, > >> Stef > > >> from BeautifulSoup import BeautifulSoup > >> from urllib import urlopen > >> from httplib import HTTP > >> from urlparse import urlparse > > >> def Check_URL_1 ( URL ) : > >> try: > >> fh = urlopen ( URL ) > >> return fh.code == 200 > >> except : > >> return False > > >> def Check_URL_2 ( URL ) : > >> p = urlparse ( URL ) > >> h = HTTP ( p[1] ) > >> h.putrequest ( 'HEAD', p[2] ) > >> h.endheaders() > >> if h.getreply()[0] == 200: > >> return True > >> else: > >> return False > > >> def Verify_Links ( URL ) : > >> Parts = URL.split('/') > >> Site = '/'.join ( Parts [:3] ) > >> Current = '/'.join ( Parts [:-1] ) > > >> fh = urlopen ( URL ) > >> lines = fh.read () > >> fh.close() > > >> Soup = BeautifulSoup ( lines ) > >> hrefs = lines = Soup.findAll ( 'a' ) > > >> for href in hrefs : > >> href = href [ 'href' ] #[:-1] ## <== remove "#" to generate all > >> errors > > >> if href.startswith ( '/' ) : > >> href = Site + href > >> elif href.startswith ('#' ) : > >> href = URL + href > >> elif href.startswith ( 'http' ) : > >> pass > >> else : > >> href = Current + href > > >> try: > >> fh = urllib.urlopen ( href ) > >> except : > >> pass > >> print Check_URL_1 ( href ), Check_URL_2 ( href ), href > > >> URL = 'http://127.0.0.1:8000/welcome/default/index' > >> fh = Verify_Links ( URL )

