Re: urllib behaves strangely
John J. Lee wrote: It looks like wikipedia checks the User-Agent header and refuses to send pages to browsers it doesn't like. Try: [...] If wikipedia is trying to discourage this kind of scraping, it's probably not polite to do it. (I don't know what wikipedia's policies are, though) They have a general policy against unapproved bots, which is understandable since badly behaved bots could mess up or delete pages. If you read the policy it is aimed at bots which modify wikipedia articles automatically. http://en.wikipedia.org/wiki/Wikipedia:Bots says: This policy in a nutshell: Programs that update pages automatically in a useful and harmless way may be welcome if their owners seek approval first and go to great lengths to stop them running amok or being a drain on resources. On the other hand something which is simply retrieving one or two fixed pages doesn't fit that definition of a bot so is probably alright. They even provide a link to some frameworks for writing bots e.g. http://sourceforge.net/projects/pywikipediabot/ -- http://mail.python.org/mailman/listinfo/python-list
Re: urllib behaves strangely
On the other hand something which is simply retrieving one or two fixed pages doesn't fit that definition of a bot so is probably alright. They i think so, too. even provide a link to some frameworks for writing bots e.g. http://sourceforge.net/projects/pywikipediabot/ ah, that looks nice .. Best regards, Gabriel. -- /---\ | If you know exactly what you will do -- | | why would you want to do it? | | (Picasso) | \---/ -- http://mail.python.org/mailman/listinfo/python-list
Re: urllib behaves strangely
headers = {} headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4' request = urllib2.Request(url, headers) file = urllib2.urlopen(request) ah, thanks a lot, that works ! Best regards, Gabriel. -- /---\ | If you know exactly what you will do -- | | why would you want to do it? | | (Picasso) | \---/ -- http://mail.python.org/mailman/listinfo/python-list
urllib behaves strangely
Here is a very simple Python script utilizing urllib: import urllib url = http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological; print url print file = urllib.urlopen( url ) mime = file.info() print mime print file.read() print file.geturl() However, when i ecexute it, i get an html error (access denied). On the one hand, the funny thing though is that i can view the page fine in my browser, and i can download it fine using curl. On the other hand, it must have something to do with the URL because urllib works fine with any other URL i have tried ... Any ideas? I would appreciate very much any hints or suggestions. Best regards, Gabriel. -- /---\ | If you know exactly what you will do -- | | why would you want to do it? | | (Picasso) | \---/ -- http://mail.python.org/mailman/listinfo/python-list
Re: urllib behaves strangely
Gabriel Zachmann wrote: Here is a very simple Python script utilizing urllib: import urllib url = http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological; print url print file = urllib.urlopen( url ) mime = file.info() print mime print file.read() print file.geturl() However, when i ecexute it, i get an html error (access denied). On the one hand, the funny thing though is that i can view the page fine in my browser, and i can download it fine using curl. On the other hand, it must have something to do with the URL because urllib works fine with any other URL i have tried ... Any ideas? I would appreciate very much any hints or suggestions. The ':' in '..Commons:Feat..' is not a legal character in this part of the URI and has to be %-quoted as '%3a'. Try the URI 'http://commons.wikimedia.org/wiki/Commons%3aFeatured_pictures/chronological', perhaps urllib is stricter than your browsers (which are known to accept every b**t you feed into them, sometimes with very confusing results) and gets confused when it tries to parse the malformed URI. -- Benjamin Niemann Email: pink at odahoda dot de WWW: http://pink.odahoda.de/ -- http://mail.python.org/mailman/listinfo/python-list
Re: urllib behaves strangely
Benjamin Niemann wrote: Gabriel Zachmann wrote: Here is a very simple Python script utilizing urllib: import urllib url = http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological; print url print file = urllib.urlopen( url ) mime = file.info() print mime print file.read() print file.geturl() However, when i ecexute it, i get an html error (access denied). On the one hand, the funny thing though is that i can view the page fine in my browser, and i can download it fine using curl. On the other hand, it must have something to do with the URL because urllib works fine with any other URL i have tried ... Any ideas? I would appreciate very much any hints or suggestions. The ':' in '..Commons:Feat..' is not a legal character in this part of the URI and has to be %-quoted as '%3a'. Oops, I was wrong... ':' *is* allowed in path segments. I should eat something, my vision starts to get blurry... Try the URI 'http://commons.wikimedia.org/wiki/Commons%3aFeatured_pictures/chronological', You may try this anyway... -- Benjamin Niemann Email: pink at odahoda dot de WWW: http://pink.odahoda.de/ -- http://mail.python.org/mailman/listinfo/python-list
Re: urllib behaves strangely
Gabriel Zachmann wrote: Here is a very simple Python script utilizing urllib: import urllib url = http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological; print url print file = urllib.urlopen( url ) mime = file.info() print mime print file.read() print file.geturl() However, when i ecexute it, i get an html error (access denied). On the one hand, the funny thing though is that i can view the page fine in my browser, and i can download it fine using curl. On the other hand, it must have something to do with the URL because urllib works fine with any other URL i have tried ... Any ideas? I would appreciate very much any hints or suggestions. Best regards, Gabriel. -- /---\ | If you know exactly what you will do -- | | why would you want to do it? | | (Picasso) | \---/ I think the problem might be with the Wikimedia Commons website itself, rather than urllib. Wikipedia has a policy against unapproved bots: http://en.wikipedia.org/wiki/Wikipedia:Bots It might be that Wikimedia Commons blocks bots that aren't approved, and might consider your program a bot. I've had similar error message from www.wikipedia.org and had no problems with a couple of other websites I've tried. Also, the html the program returns seems to be a standard ACCESS DENIED page. I might be worth asking at the Wikimedia Commons website, at least to eliminate this possibility. John Hicken -- http://mail.python.org/mailman/listinfo/python-list
Re: urllib behaves strangely
Gabriel Zachmann wrote: Here is a very simple Python script utilizing urllib: import urllib url = http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi cal print url print file = urllib.urlopen( url ) mime = file.info() print mime print file.read() print file.geturl() However, when i ecexute it, i get an html error (access denied). On the one hand, the funny thing though is that i can view the page fine in my browser, and i can download it fine using curl. On the other hand, it must have something to do with the URL because urllib works fine with any other URL i have tried ... It looks like wikipedia checks the User-Agent header and refuses to send pages to browsers it doesn't like. Try: headers = {} headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4' request = urllib2.Request(url, headers) file = urllib2.urlopen(request) ... That (or code very like it) worked when I tried it. -- http://mail.python.org/mailman/listinfo/python-list
Re: urllib behaves strangely
Duncan Booth [EMAIL PROTECTED] writes: Gabriel Zachmann wrote: Here is a very simple Python script utilizing urllib: [...] http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi cal print url print file = urllib.urlopen( url ) [...] However, when i ecexute it, i get an html error (access denied). On the one hand, the funny thing though is that i can view the page fine in my browser, and i can download it fine using curl. [...] On the other hand, it must have something to do with the URL because urllib works fine with any other URL i have tried ... It looks like wikipedia checks the User-Agent header and refuses to send pages to browsers it doesn't like. Try: [...] If wikipedia is trying to discourage this kind of scraping, it's probably not polite to do it. (I don't know what wikipedia's policies are, though) John -- http://mail.python.org/mailman/listinfo/python-list