Re: urllib behaves strangely

2006-06-13 Thread Duncan Booth
John J. Lee wrote:

 It looks like wikipedia checks the User-Agent header and refuses to
 send pages to browsers it doesn't like. Try:
 [...]
 
 If wikipedia is trying to discourage this kind of scraping, it's
 probably not polite to do it.  (I don't know what wikipedia's policies
 are, though)

They have a general policy against unapproved bots, which is
understandable since badly behaved bots could mess up or delete pages.
If you read the policy it is aimed at bots which modify wikipedia
articles automatically. 

http://en.wikipedia.org/wiki/Wikipedia:Bots says:
 This policy in a nutshell:
 Programs that update pages automatically in a useful and harmless way
 may be welcome if their owners seek approval first and go to great
 lengths to stop them running amok or being a drain on resources.

On the other hand something which is simply retrieving one or two fixed
pages doesn't fit that definition of a bot so is probably alright. They 
even provide a link to some frameworks for writing bots e.g. 

http://sourceforge.net/projects/pywikipediabot/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: urllib behaves strangely

2006-06-13 Thread Gabriel Zachmann
 On the other hand something which is simply retrieving one or two fixed
 pages doesn't fit that definition of a bot so is probably alright. They 

i think so, too.

even provide a link to some frameworks for writing bots e.g.
 
 http://sourceforge.net/projects/pywikipediabot/


ah, that looks nice ..

Best regards,
Gabriel.

-- 
/---\
| If you know exactly what you will do --   |
| why would you want to do it?  |
|   (Picasso)   |
\---/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: urllib behaves strangely

2006-06-13 Thread Gabriel Zachmann
 headers = {}
 headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; 
 rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'
 
 request = urllib2.Request(url, headers)
 file = urllib2.urlopen(request)


ah, thanks a lot, that works !

Best regards,
Gabriel.

-- 
/---\
| If you know exactly what you will do --   |
| why would you want to do it?  |
|   (Picasso)   |
\---/
-- 
http://mail.python.org/mailman/listinfo/python-list


urllib behaves strangely

2006-06-12 Thread Gabriel Zachmann
Here is a very simple Python script utilizing urllib:

 import urllib
 url = 
http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological;
 print url
 print
 file = urllib.urlopen( url )
 mime = file.info()
 print mime
 print file.read()
 print file.geturl()


However, when i ecexute it, i get an html error (access denied).

On the one hand, the funny thing though is that i can view the page fine in my 
browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because urllib 
works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.

Best regards,
Gabriel.


-- 
/---\
| If you know exactly what you will do --   |
| why would you want to do it?  |
|   (Picasso)   |
\---/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: urllib behaves strangely

2006-06-12 Thread Benjamin Niemann
Gabriel Zachmann wrote:

 Here is a very simple Python script utilizing urllib:
 
  import urllib
  url =
 http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological;
  print url
  print
  file = urllib.urlopen( url )
  mime = file.info()
  print mime
  print file.read()
  print file.geturl()
 
 
 However, when i ecexute it, i get an html error (access denied).
 
 On the one hand, the funny thing though is that i can view the page fine
 in my browser, and i can download it fine using curl.
 
 On the other hand, it must have something to do with the URL because
 urllib works fine with any other URL i have tried ...
 
 Any ideas?
 I would appreciate very much any hints or suggestions.

The ':' in '..Commons:Feat..' is not a legal character in this part of the
URI and has to be %-quoted as '%3a'.
Try the URI
'http://commons.wikimedia.org/wiki/Commons%3aFeatured_pictures/chronological',
perhaps urllib is stricter than your browsers (which are known to accept
every b**t you feed into them, sometimes with very confusing results)
and gets confused when it tries to parse the malformed URI.

-- 
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: urllib behaves strangely

2006-06-12 Thread Benjamin Niemann
Benjamin Niemann wrote:

 Gabriel Zachmann wrote:
 
 Here is a very simple Python script utilizing urllib:
 
  import urllib
  url =
 http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological;
  print url
  print
  file = urllib.urlopen( url )
  mime = file.info()
  print mime
  print file.read()
  print file.geturl()
 
 
 However, when i ecexute it, i get an html error (access denied).
 
 On the one hand, the funny thing though is that i can view the page fine
 in my browser, and i can download it fine using curl.
 
 On the other hand, it must have something to do with the URL because
 urllib works fine with any other URL i have tried ...
 
 Any ideas?
 I would appreciate very much any hints or suggestions.
 
 The ':' in '..Commons:Feat..' is not a legal character in this part of the
 URI and has to be %-quoted as '%3a'.

Oops, I was wrong... ':' *is* allowed in path segments. I should eat
something, my vision starts to get blurry...

 Try the URI
 'http://commons.wikimedia.org/wiki/Commons%3aFeatured_pictures/chronological',

You may try this anyway...


-- 
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: urllib behaves strangely

2006-06-12 Thread John Hicken

Gabriel Zachmann wrote:

 Here is a very simple Python script utilizing urllib:

  import urllib
  url =
 http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological;
  print url
  print
  file = urllib.urlopen( url )
  mime = file.info()
  print mime
  print file.read()
  print file.geturl()


 However, when i ecexute it, i get an html error (access denied).

 On the one hand, the funny thing though is that i can view the page fine in my
 browser, and i can download it fine using curl.

 On the other hand, it must have something to do with the URL because urllib
 works fine with any other URL i have tried ...

 Any ideas?
 I would appreciate very much any hints or suggestions.

 Best regards,
 Gabriel.


 --
 /---\
 | If you know exactly what you will do --   |
 | why would you want to do it?  |
 |   (Picasso)   |
 \---/

I think the problem might be with the Wikimedia Commons website itself,
rather than urllib.  Wikipedia has a policy against unapproved bots:
http://en.wikipedia.org/wiki/Wikipedia:Bots

It might be that Wikimedia Commons blocks bots that aren't approved,
and might consider your program a bot.  I've had similar error message
from www.wikipedia.org and had no problems with a couple of other
websites I've tried.  Also, the html the program returns seems to be a
standard ACCESS DENIED page.

I might be worth asking at the Wikimedia Commons website, at least to
eliminate this possibility.

John Hicken

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: urllib behaves strangely

2006-06-12 Thread Duncan Booth
Gabriel Zachmann wrote:

 Here is a very simple Python script utilizing urllib:
 
  import urllib
  url = 
 http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi
 cal 
  print url
  print
  file = urllib.urlopen( url )
  mime = file.info()
  print mime
  print file.read()
  print file.geturl()
 
 
 However, when i ecexute it, i get an html error (access denied).
 
 On the one hand, the funny thing though is that i can view the page
 fine in my browser, and i can download it fine using curl.
 
 On the other hand, it must have something to do with the URL because
 urllib works fine with any other URL i have tried ...
 
It looks like wikipedia checks the User-Agent header and refuses to send 
pages to browsers it doesn't like. Try:

headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; 
rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'

request = urllib2.Request(url, headers)
file = urllib2.urlopen(request)
...

That (or code very like it) worked when I tried it.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: urllib behaves strangely

2006-06-12 Thread John J. Lee
Duncan Booth [EMAIL PROTECTED] writes:

 Gabriel Zachmann wrote:
 
  Here is a very simple Python script utilizing urllib:
[...]
  http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi
  cal 
   print url
   print
   file = urllib.urlopen( url )
[...]
  However, when i ecexute it, i get an html error (access denied).
  
  On the one hand, the funny thing though is that i can view the page
  fine in my browser, and i can download it fine using curl.
[...]
  
  On the other hand, it must have something to do with the URL because
  urllib works fine with any other URL i have tried ...
  
 It looks like wikipedia checks the User-Agent header and refuses to send 
 pages to browsers it doesn't like. Try:
[...]

If wikipedia is trying to discourage this kind of scraping, it's
probably not polite to do it.  (I don't know what wikipedia's policies
are, though)


John
-- 
http://mail.python.org/mailman/listinfo/python-list