[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-10 Thread Eduardo A . Bustamante López

Eduardo A. Bustamante López added the comment:

Hi Senthil,

> I fail to see the bug in here. Robotparser module is for reading and
> parsing the robot.txt file, the module responsible for fetching it
> could urllib.

You're right, but robotparser's read() does a call to urllib.request.urlopen to
fetch the robots.txt file. If robotparser took a file object, or something like
that instead of a Url, I wouldn't think of this as a bug, but it doesn't. The
default behaviour is for it to fetch the file itself, using urlopen.

Also, I'm aware that you shouldn't normally worry about setting a specific
user-agent to fetch the file. But that's not the case of Wikipedia. In my case,
Wikipedia returned 403 for the urllib user-agent. And since there's no
documented way of specifying a particular user-agent in robotparser, or to feed
a file object to robotparser, I decided to report this.

Only after reading the source of 2.7.x and 3.x, one can find work-arounds for
that problem, since it's not really clear how these make the requests for the
robots.txt in the documentation.

--

___
Python tracker 
<http://bugs.python.org/issue15851>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-09 Thread Eduardo A . Bustamante López

Eduardo A. Bustamante López added the comment:

I forgot to mention that I ran a nc process in parallel, to see what data is
being sent: ``nc -l -p ``.

--

___
Python tracker 
<http://bugs.python.org/issue15851>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-09 Thread Eduardo A . Bustamante López

Eduardo A. Bustamante López added the comment:

I'm not sure what's the best approach here.

1. Avoid changes in the Lib, and document a work-around, which involves
   installing an opener with the specific User-agent. The draw-back is that it
   modifies the behaviour of urlopen() globally, so that change affects any
   other call to urllib.request.urlopen.

2. Revert to the old way, using an instance of a FancyURLopener (or URLopener),
   in the RobotFileParser class. This requires a modification of the Lib, but
   allows us to modify only the behaviour of that specific instance of
   RobotFileParser. The user could sub-class FancyURLopener, set the appropiate
   version string.

I attach a script, tested against the ``default`` branch of the mercurial
repository. It shows the work around for python3.3.

--
Added file: http://bugs.python.org/file27158/test.py

___
Python tracker 
<http://bugs.python.org/issue15851>
___import urllib.robotparser
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'MyUa/0.1')]
urllib.request.install_opener(opener)
rp = urllib.robotparser.RobotFileParser('http://localhost:')
rp.read()
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-02 Thread Eduardo A . Bustamante López

Eduardo A. Bustamante López added the comment:

I guess a workaround is to do:

robotparser.URLopener.version = 'MyVersion'

--

___
Python tracker 
<http://bugs.python.org/issue15851>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-02 Thread Eduardo A . Bustamante López

Changes by Eduardo A. Bustamante López :


Added file: http://bugs.python.org/file27101/myrobotparser.py

___
Python tracker 
<http://bugs.python.org/issue15851>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-02 Thread Eduardo A . Bustamante López

New submission from Eduardo A. Bustamante López:

I found that http://en.wikipedia.org/robots.txt returns 403 if the provided 
user agent is in a specific blacklist.

And since robotparser doesn't provide a mechanism to change the default user 
agent used by the opener, it becomes unusable for that site (and sites that 
have a similar policy).

I think the user should have the possibility to set a specific user agent 
string, to better identify their bot.

I attach a patch that allows the user to change the opener used by 
RobotFileParser, in case the need of some specific behavior arises.

I also attach a simple example of how it solves the issue, at least with 
wikipedia.

--
components: Library (Lib)
files: robotparser.py.diff
keywords: patch
messages: 169718
nosy: Eduardo.A..Bustamante.López
priority: normal
severity: normal
status: open
title: Lib/robotparser.py doesn't accept setting a user agent string, instead 
it uses the default.
type: enhancement
versions: Python 2.7
Added file: http://bugs.python.org/file27100/robotparser.py.diff

___
Python tracker 
<http://bugs.python.org/issue15851>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com