[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2016-11-21 Thread Raymond Hettinger

Changes by Raymond Hettinger :


--
assignee: rhettinger -> 

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2016-11-21 Thread Mark Lawrence

Changes by Mark Lawrence :


--
nosy:  -BreamoreBoy

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2016-11-20 Thread Xiang Zhang

Changes by Xiang Zhang :


--
versions: +Python 3.7 -Python 3.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2014-06-22 Thread Mark Lawrence

Mark Lawrence added the comment:

The code given in msg183579 works perfectly in 3.4.1 and 3.5.0.  Is there 
anything to fix here whether code or docs?

--
nosy: +BreamoreBoy

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2014-06-22 Thread karl

karl added the comment:

Mark,

The code is using urllib for demonstrating the issue with wikipedia and other 
sites which are blocking python-urllib user agents because it is used by many 
spam harvesters.

The proposal is about giving a possibility in robotparser lib to add a feature 
for setting up the user-agent.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2014-06-22 Thread karl

karl added the comment:

Note that one of the proposal is to just document in
https://docs.python.org/3/library/urllib.robotparser.html
the proposal made in msg169722 (available in 3.4+)

robotparser.URLopener.version = 'MyVersion'

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2014-06-22 Thread Mark Lawrence

Mark Lawrence added the comment:

c:\cpython\PCbuildpython_d.exe -V
Python 3.5.0a0

c:\cpython\PCbuildtype C:\Users\Mark\MyPython\mytest.py
#!/usr/bin/env python3
# -*- coding: latin-1 -*-

import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Python-urllib')]
fobj = opener.open('http://en.wikipedia.org/robots.txt')
print('Finished, no traceback here')

c:\cpython\PCbuildpython_d.exe C:\Users\Mark\MyPython\mytest.py
Finished, no traceback here

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2014-06-22 Thread Raymond Hettinger

Changes by Raymond Hettinger raymond.hettin...@gmail.com:


--
assignee:  - rhettinger
nosy: +rhettinger
versions: +Python 3.5 -Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2014-06-22 Thread karl

karl added the comment:

→ python
Python 2.7.5 (default, Mar  9 2014, 22:15:05) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type help, copyright, credits or license for more information.
 import robotparser
 rp = robotparser.RobotFileParser('http://somesite.test.site/robots.txt')
 rp.read()
 


Let's check the server logs:

127.0.0.1 - - [23/Jun/2014:08:44:37 +0900] GET /robots.txt HTTP/1.0 200 92 
- Python-urllib/1.17

Robotparser by default was using in 2.* the Python-urllib/1.17 user agent which 
is traditionally blocked by many sysadmins. A solution has been already 
proposed above:

This is the proposed test for 3.4

import urllib.robotparser
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'MyUa/0.1')]
urllib.request.install_opener(opener)
rp = urllib.robotparser.RobotFileParser('http://localhost:')
rp.read()


The issue is not anymore about changing the lib, but just about documenting on 
how to change the RobotFileParser default UA. We can change the title of this 
issue if it's confusing. Or close it and open a new one for documenting what 
makes it easier :)

Currently robotparser.py imports urllib user agent.
http://hg.python.org/cpython/file/7dc94337ef67/Lib/urllib/request.py#l364

It's a common failure we encounter when using urllib in general, including 
robotparser.


As for wikipedia, they fixed their server side user agent sniffing, and do not 
filter anymore python-urllib. 

GET /robots.txt HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate, compress
Host: en.wikipedia.org
User-Agent: Python-urllib/1.17

HTTP/1.1 200 OK
Accept-Ranges: bytes
Age: 3161
Cache-control: s-maxage=3600, must-revalidate, max-age=0
Connection: keep-alive
Content-Encoding: gzip
Content-Length: 5208
Content-Type: text/plain; charset=utf-8
Date: Sun, 22 Jun 2014 23:59:16 GMT
Last-modified: Tue, 26 Nov 2013 17:39:43 GMT
Server: Apache
Set-Cookie: GeoIP=JP:Tokyo:35.6850:139.7514:v4; Path=/; Domain=.wikipedia.org
Vary: X-Subdomain
Via: 1.1 varnish, 1.1 varnish, 1.1 varnish
X-Article-ID: 19292575
X-Cache: cp1065 miss (0), cp4016 hit (1), cp4009 frontend hit (215)
X-Content-Type-Options: nosniff
X-Language: en
X-Site: wikipedia
X-Varnish: 2529666795, 2948866481 2948865637, 4134826198 4130750894


Many other sites still do. :)

--
versions: +Python 3.4 -Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2014-06-22 Thread Raymond Hettinger

Changes by Raymond Hettinger raymond.hettin...@gmail.com:


--
versions: +Python 3.5 -Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2013-03-11 Thread Tshepang Lekhonkhobe

Changes by Tshepang Lekhonkhobe tshep...@gmail.com:


--
nosy: +tshepang

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2013-03-05 Thread karl

karl added the comment:

Setting a user agent string should be possible.
My guess is that the default library has been used by an abusive client (by 
mistake or intent) and wikimedia project has decided to blacklist the client 
based on the user-agent string sniffing.

The match is on anything which matches

Python-urllib in UserAgentString

See below:

 import urllib.request
 opener = urllib.request.build_opener()
 opener.addheaders = [('User-agent', 'Python-urllib')]
 fobj = opener.open('http://en.wikipedia.org/robots.txt')
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py,
 line 479, in open
response = meth(req, response)
  File 
/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py,
 line 591, in http_response
'http', request, response, code, msg, hdrs)
  File 
/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py,
 line 517, in error
return self._call_chain(*args)
  File 
/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py,
 line 451, in _call_chain
result = func(*args)
  File 
/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py,
 line 599, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
 import urllib.request
 opener = urllib.request.build_opener()
 opener.addheaders = [('User-agent', 'Pythonurllib/3.3')]
 fobj = opener.open('http://en.wikipedia.org/robots.txt')
 fobj
http.client.HTTPResponse object at 0x101275850
 import urllib.request
 opener = urllib.request.build_opener()
 opener.addheaders = [('User-agent', 'Pyt-honurllib/3.3')]
 fobj = opener.open('http://en.wikipedia.org/robots.txt')
 import urllib.request
 opener = urllib.request.build_opener()
 opener.addheaders = [('User-agent', 'Python-urllib')]
 fobj = opener.open('http://en.wikipedia.org/robots.txt')
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py,
 line 479, in open
response = meth(req, response)
  File 
/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py,
 line 591, in http_response
'http', request, response, code, msg, hdrs)
  File 
/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py,
 line 517, in error
return self._call_chain(*args)
  File 
/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py,
 line 451, in _call_chain
result = func(*args)
  File 
/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py,
 line 599, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
 import urllib.request
 opener = urllib.request.build_opener()
 opener.addheaders = [('User-agent', 'Python-urlli')]
 fobj = opener.open('http://en.wikipedia.org/robots.txt')
 

Being able to change the header might indeed be a good thing.

--
nosy: +karlcow

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-11 Thread Senthil Kumaran

Senthil Kumaran added the comment:

Hi Eduardo,

I tested further and do observe some very strange oddities.

On Mon, Sep 10, 2012 at 10:45 PM, Eduardo A. Bustamante López
rep...@bugs.python.org wrote:

 Also, I'm aware that you shouldn't normally worry about setting a specific
 user-agent to fetch the file. But that's not the case of Wikipedia. In my 
 case,
 Wikipedia returned 403 for the urllib user-agent.

Yeah, this really surprised me. I would normally assume robots.txt to
be readable by any agent, but I think something odd is happening.

In 2.7, I do not see the problem because, the implementation is:

import urllib

class URLOpener(urllib.FancyURLopener):
def __init__(self, *args):
urllib.FancyURLopener.__init__(self, *args)
self.errcode = 200

opener = URLOpener()
fobj = opener.open('http://en.wikipedia.org/robots.txt')
print opener.errcode

This will print 200 and everything is fine. Also, look at it that
robots.txt is accessible.

In 3.3, the implementation is:

import urllib.request

try:
fobj = urllib.request.urlopen('http://en.wikipedia.org/robots.txt')
except urllib.error.HTTPError as err:
print(err.code)

This gives 403.  I would normally expect this to work without any issues.
But according to my analysis, what is happening is when the User-agent
is set to something which has '-' in that, the server is rejecting it
with 403.

In the above code, what is happening underlying is this:

import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Python-urllib/3.3')]
fobj = opener.open('http://en.wikipedia.org/robots.txt')
print(fobj.getcode())

This would give 403. In order to see it work, change the addheaders line to

opener.addheaders = [('', '')]
opener.addheaders = [('User-agent', 'Pythonurllib/3.3')]
opener.addheaders = [('User-agent', 'KillerSpamBot')]

All should work (as expected).

So, thing which surrprises me is, if sending Python-urllib/3.3 is a
mistake for THAT Server.
Is this a server oddity at Wikipedia part? ( Coz, I refered to hg log
to see from when we are sending Python-urllib/version and it seems
that it's being sent for long time).

Can't see how should this be fixed in urllib.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-10 Thread Senthil Kumaran

Senthil Kumaran added the comment:

Hello Eduardo,

I fail to see the bug in here. Robotparser module is for reading and
parsing the robot.txt file, the module responsible for fetching it
could urllib. robots.txt is always available from web-server and you
can download the robot.txt by any means, even by using
robotparser.read by providing the full url to robots.txt. You do not
need to set user-agent to read/fetch the robots.txt file. Once
fetched, now when you are crawling the site using your custom written
crawler or using urllib, you can honor the User-Agent requirement by
sending proper headers with your request. That can be done using
urllib module itself and there is documentation on adding headers I
believe.

I think, this is way most folks would be (or I believe are ) using it.
Am I missing something? If my above explanation is okay, then we can
close this bug as invalid.

Thanks,
Senthil

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-10 Thread Eduardo A . Bustamante López

Eduardo A. Bustamante López added the comment:

Hi Senthil,

 I fail to see the bug in here. Robotparser module is for reading and
 parsing the robot.txt file, the module responsible for fetching it
 could urllib.

You're right, but robotparser's read() does a call to urllib.request.urlopen to
fetch the robots.txt file. If robotparser took a file object, or something like
that instead of a Url, I wouldn't think of this as a bug, but it doesn't. The
default behaviour is for it to fetch the file itself, using urlopen.

Also, I'm aware that you shouldn't normally worry about setting a specific
user-agent to fetch the file. But that's not the case of Wikipedia. In my case,
Wikipedia returned 403 for the urllib user-agent. And since there's no
documented way of specifying a particular user-agent in robotparser, or to feed
a file object to robotparser, I decided to report this.

Only after reading the source of 2.7.x and 3.x, one can find work-arounds for
that problem, since it's not really clear how these make the requests for the
robots.txt in the documentation.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-09 Thread Eduardo A . Bustamante López

Eduardo A. Bustamante López added the comment:

I'm not sure what's the best approach here.

1. Avoid changes in the Lib, and document a work-around, which involves
   installing an opener with the specific User-agent. The draw-back is that it
   modifies the behaviour of urlopen() globally, so that change affects any
   other call to urllib.request.urlopen.

2. Revert to the old way, using an instance of a FancyURLopener (or URLopener),
   in the RobotFileParser class. This requires a modification of the Lib, but
   allows us to modify only the behaviour of that specific instance of
   RobotFileParser. The user could sub-class FancyURLopener, set the appropiate
   version string.

I attach a script, tested against the ``default`` branch of the mercurial
repository. It shows the work around for python3.3.

--
Added file: http://bugs.python.org/file27158/test.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___import urllib.robotparser
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'MyUa/0.1')]
urllib.request.install_opener(opener)
rp = urllib.robotparser.RobotFileParser('http://localhost:')
rp.read()
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-09 Thread Eduardo A . Bustamante López

Eduardo A. Bustamante López added the comment:

I forgot to mention that I ran a nc process in parallel, to see what data is
being sent: ``nc -l -p ``.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-08 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-07 Thread Terry J. Reedy

Terry J. Reedy added the comment:

Enhancements can only be targeted at 3.4, where robotparser is now 
urllib.robotparser

I wonder if documenting the simple solution would be sufficient.

--
nosy: +orsenthil, terry.reedy
versions: +Python 3.4 -Python 2.7

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-07 Thread Terry J. Reedy

Terry J. Reedy added the comment:

In any case, a doc change *could* go in 2.7 and 3.3/2.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-02 Thread Eduardo A . Bustamante López

New submission from Eduardo A. Bustamante López:

I found that http://en.wikipedia.org/robots.txt returns 403 if the provided 
user agent is in a specific blacklist.

And since robotparser doesn't provide a mechanism to change the default user 
agent used by the opener, it becomes unusable for that site (and sites that 
have a similar policy).

I think the user should have the possibility to set a specific user agent 
string, to better identify their bot.

I attach a patch that allows the user to change the opener used by 
RobotFileParser, in case the need of some specific behavior arises.

I also attach a simple example of how it solves the issue, at least with 
wikipedia.

--
components: Library (Lib)
files: robotparser.py.diff
keywords: patch
messages: 169718
nosy: Eduardo.A..Bustamante.López
priority: normal
severity: normal
status: open
title: Lib/robotparser.py doesn't accept setting a user agent string, instead 
it uses the default.
type: enhancement
versions: Python 2.7
Added file: http://bugs.python.org/file27100/robotparser.py.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-02 Thread Eduardo A . Bustamante López

Changes by Eduardo A. Bustamante López dual...@gmail.com:


Added file: http://bugs.python.org/file27101/myrobotparser.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15851] Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

2012-09-02 Thread Eduardo A . Bustamante López

Eduardo A. Bustamante López added the comment:

I guess a workaround is to do:

robotparser.URLopener.version = 'MyVersion'

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15851
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com