[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.
Eduardo A. Bustamante López added the comment: Hi Senthil, > I fail to see the bug in here. Robotparser module is for reading and > parsing the robot.txt file, the module responsible for fetching it > could urllib. You're right, but robotparser's read() does a call to urllib.request.urlopen to fetch the robots.txt file. If robotparser took a file object, or something like that instead of a Url, I wouldn't think of this as a bug, but it doesn't. The default behaviour is for it to fetch the file itself, using urlopen. Also, I'm aware that you shouldn't normally worry about setting a specific user-agent to fetch the file. But that's not the case of Wikipedia. In my case, Wikipedia returned 403 for the urllib user-agent. And since there's no documented way of specifying a particular user-agent in robotparser, or to feed a file object to robotparser, I decided to report this. Only after reading the source of 2.7.x and 3.x, one can find work-arounds for that problem, since it's not really clear how these make the requests for the robots.txt in the documentation. -- ___ Python tracker <http://bugs.python.org/issue15851> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.
Eduardo A. Bustamante López added the comment: I forgot to mention that I ran a nc process in parallel, to see what data is being sent: ``nc -l -p ``. -- ___ Python tracker <http://bugs.python.org/issue15851> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.
Eduardo A. Bustamante López added the comment: I'm not sure what's the best approach here. 1. Avoid changes in the Lib, and document a work-around, which involves installing an opener with the specific User-agent. The draw-back is that it modifies the behaviour of urlopen() globally, so that change affects any other call to urllib.request.urlopen. 2. Revert to the old way, using an instance of a FancyURLopener (or URLopener), in the RobotFileParser class. This requires a modification of the Lib, but allows us to modify only the behaviour of that specific instance of RobotFileParser. The user could sub-class FancyURLopener, set the appropiate version string. I attach a script, tested against the ``default`` branch of the mercurial repository. It shows the work around for python3.3. -- Added file: http://bugs.python.org/file27158/test.py ___ Python tracker <http://bugs.python.org/issue15851> ___import urllib.robotparser import urllib.request opener = urllib.request.build_opener() opener.addheaders = [('User-agent', 'MyUa/0.1')] urllib.request.install_opener(opener) rp = urllib.robotparser.RobotFileParser('http://localhost:') rp.read() ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.
Eduardo A. Bustamante López added the comment: I guess a workaround is to do: robotparser.URLopener.version = 'MyVersion' -- ___ Python tracker <http://bugs.python.org/issue15851> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.
Changes by Eduardo A. Bustamante López : Added file: http://bugs.python.org/file27101/myrobotparser.py ___ Python tracker <http://bugs.python.org/issue15851> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.
New submission from Eduardo A. Bustamante López: I found that http://en.wikipedia.org/robots.txt returns 403 if the provided user agent is in a specific blacklist. And since robotparser doesn't provide a mechanism to change the default user agent used by the opener, it becomes unusable for that site (and sites that have a similar policy). I think the user should have the possibility to set a specific user agent string, to better identify their bot. I attach a patch that allows the user to change the opener used by RobotFileParser, in case the need of some specific behavior arises. I also attach a simple example of how it solves the issue, at least with wikipedia. -- components: Library (Lib) files: robotparser.py.diff keywords: patch messages: 169718 nosy: Eduardo.A..Bustamante.López priority: normal severity: normal status: open title: Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default. type: enhancement versions: Python 2.7 Added file: http://bugs.python.org/file27100/robotparser.py.diff ___ Python tracker <http://bugs.python.org/issue15851> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com