[issue17403] Robotparser fails to parse some robots.txt

2015-03-12 Thread Berker Peksag
Berker Peksag added the comment: Yes, this doesn't look like a security issue to me. Too late for 3.2. Closing this as "fixed". -- nosy: +berker.peksag resolution: -> fixed stage: patch review -> resolved status: open -> closed versions: -Python 3.2 __

[issue17403] Robotparser fails to parse some robots.txt

2015-03-12 Thread Martin Panter
Martin Panter added the comment: Perhaps it’s too late to modify the 3.2 branch now? IMO the change made for this bug abuses the behaviour of urlunparse() removing empty query strings; see Issue 22852 where I proposed to stop it doing that. -- nosy: +vadmium __

[issue17403] Robotparser fails to parse some robots.txt

2013-05-29 Thread Senthil Kumaran
Senthil Kumaran added the comment: This is fixed in default, 3.3 and 2.7. I will merge this change to 3.2 code line before closing this. I shall raise a new request for updating robotparser with other goodies. -- ___ Python tracker

[issue17403] Robotparser fails to parse some robots.txt

2013-05-29 Thread Roundup Robot
Roundup Robot added the comment: New changeset 30128355f53b by Senthil Kumaran in branch '3.3': #17403: urllib.parse.robotparser normalizes the urls before adding to ruleline. http://hg.python.org/cpython/rev/30128355f53b New changeset e954d7a3bb8a by Senthil Kumaran in branch 'default': merge f

[issue17403] Robotparser fails to parse some robots.txt

2013-04-22 Thread Senthil Kumaran
Senthil Kumaran added the comment: My suggestion for this issue is going ahead with patch2 of Mher. It does a simple normalization and does the right thing. The case in the question is an empty query string and behavior or Allow and Disallow for that and patch addresses that. (I don't know wh

[issue17403] Robotparser fails to parse some robots.txt

2013-04-22 Thread Łukasz Langa
Łukasz Langa added the comment: robotparser implements http://www.robotstxt.org/orig.html, there's even a link to this document at http://docs.python.org/3/library/urllib.robotparser.html. As mher points out, there's a newer version of that spec formed as RFC: http://www.robotstxt.org/norobots

[issue17403] Robotparser fails to parse some robots.txt

2013-04-22 Thread R. David Murray
R. David Murray added the comment: I haven't a clue, that was part of the research I was going to do but haven't done yet (and probably won't for now...I'll wait to see if you or Lukaz pick it up first :). I see he didn't nosy himself on the issue yet, though, so I've done that. Maybe he'll

[issue17403] Robotparser fails to parse some robots.txt

2013-04-22 Thread Mher Movsisyan
Mher Movsisyan added the comment: Can you share the link of the new robots.txt standard? I may help to implement it. -- ___ Python tracker ___ __

[issue17403] Robotparser fails to parse some robots.txt

2013-04-21 Thread R. David Murray
Changes by R. David Murray : -- keywords: -easy ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.py

[issue17403] Robotparser fails to parse some robots.txt

2013-04-21 Thread R. David Murray
R. David Murray added the comment: Lucaz pointed out on IRC that the problem is that the current robotparser is implementing an outdated robots.txt standard. He may work on fixing that. -- ___ Python tracker

[issue17403] Robotparser fails to parse some robots.txt

2013-03-26 Thread R. David Murray
R. David Murray added the comment: Well, the code is easy. Figuring out what the code is supposed to do turns out to be hard, but we didn't know that when we marked it as easy :) I want to do more research before OKing a fix for this. (There is clearly a bug, I'm just not certain what the

[issue17403] Robotparser fails to parse some robots.txt

2013-03-26 Thread andrew cooke
andrew cooke added the comment: thanks (only subscribed to this now, so no previous email). my guess is that google are assuming a dumb regexp so http://example.com/foo? in a rule does not match http://example.com/foo and also i realised that http://google.com/robots.txt doesn't contai

[issue17403] Robotparser fails to parse some robots.txt

2013-03-26 Thread Ezio Melotti
Ezio Melotti added the comment: Rietveld is the review tool. You can access it by clicking on the "review" link at the right of the patch. You should have received an email as well when I made the review. -- ___ Python tracker

[issue17403] Robotparser fails to parse some robots.txt

2013-03-26 Thread andrew cooke
andrew cooke added the comment: what is rietveld? and why is this marked as "easy"? it seems like it involves issues that aren't described well in the spec - it requires some kind of canonical way to describe urls with (and without) parameters to solve completely. -- nosy: +acooke _

[issue17403] Robotparser fails to parse some robots.txt

2013-03-19 Thread Ezio Melotti
Ezio Melotti added the comment: I left a couple of comments on rietveld. -- stage: test needed -> patch review versions: +Python 3.2, Python 3.3, Python 3.4 ___ Python tracker __

[issue17403] Robotparser fails to parse some robots.txt

2013-03-19 Thread Mher Movsisyan
Mher Movsisyan added the comment: The second patch only normalizes the url. From http://www.robotstxt.org/norobots-rfc.txt it is not clear how to handle multiple rules with the same prefix. -- Added file: http://bugs.python.org/file29476/parser2.patch _

[issue17403] Robotparser fails to parse some robots.txt

2013-03-18 Thread Mher Movsisyan
Mher Movsisyan added the comment: Attaching patch. -- keywords: +patch nosy: +mher Added file: http://bugs.python.org/file29457/parser.patch ___ Python tracker ___ __

[issue17403] Robotparser fails to parse some robots.txt

2013-03-14 Thread Ezio Melotti
Changes by Ezio Melotti : -- keywords: +easy nosy: +ezio.melotti stage: -> test needed ___ Python tracker ___ ___ Python-bugs-list ma

[issue17403] Robotparser fails to parse some robots.txt

2013-03-12 Thread Ben Mezger
New submission from Ben Mezger: I am trying to parse Google's robots.txt (http://google.com/robots.txt) and it fails when checking whether I can crawl the url /catalogs/p? (which it's allowed) but it's returning false, according to my question on stackoverflow -> http://stackoverflow.com/quest