[issue39187] urllib.robotparser does not respect the longest match for the rule

2022-04-07 Thread Andre Burgaud


Andre Burgaud  added the comment:

Hi Matele,

Thanks for looking into this issue.

I have seen indeed some implementations that were based on the Python 
implementation and that had the same problems. The Crystal implementation in 
particular (as far as I remember, as it was a while ago). As a reference, I 
used the Google implementation https://github.com/google/robotstxt that 
respects the internet draft 
https://datatracker.ietf.org/doc/html/draft-koster-rep-00.

The 2 main points are described in section 
https://datatracker.ietf.org/doc/html/draft-koster-rep-00#section-2.2.2, 
especially in the following paragraph:

   "To evaluate if access to a URI is allowed, a robot MUST match the
   paths in allow and disallow rules against the URI.  The matching
   SHOULD be case sensitive.  The most specific match found MUST be
   used.  The most specific match is the match that has the most octets.
   If an allow and disallow rule is equivalent, the allow SHOULD be
   used."

1) The most specific match found MUST be used.  The most specific match is the 
match that has the most octets.
2) If an allow and disallow rule is equivalent, the allow SHOULD be used.

In the robots.txt example you provided, the longest rule is Allow: 
/wp-admin/admin-ajax.php. Therefore it will take precedence over the other 
shorter Disallow rule for the sub-directory admin-ajax.php that should be 
allowed. To achieve that, the sort of the rule should list the Allow rule first.

I'm currently traveling. I'm sorry if my explanations sound a bit limited. If 
it helps, I can pickup this discussion when I'm back home, after mid-April. In 
particular, I can run new tests with Python 3.10, since I raised this potential 
problem a bit more than two years ago and that I may need to refresh my memory 
:-) 

In the meantime, let me know if there is anything that I could provide to give 
a clearer background. For example, are you referring to the 2 issues I 
highlighted above, or is it something else that you are thinking about. Also, 
could you point me to the other robots checkers that you looked at?

Thanks!

Andre

--

___
Python tracker 
<https://bugs.python.org/issue39187>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35457] robotparser reads empty robots.txt file as "all denied"

2020-01-02 Thread Andre Burgaud


Andre Burgaud  added the comment:

Thanks @xtreak for providing some clarification on this behavior! I can write 
some tests to cover this behavior, assuming that we agree that an empty file 
means "unlimited access". This was worded as such in the old internet draft 
from 1996 (section 3.2.1 in https://www.robotstxt.org/norobots-rfc.txt). The 
current draft is more ambiguous with "If no group satisfies either condition, 
or no groups are present at all, no rules apply." 
https://tools.ietf.org/html/draft-koster-rep-00#section-2.2.1

https://www.robotstxt.org/robotstxt.html clearly states that an empty file 
gives full access, but I'm getting lost in figuring out which is the official 
spec at the moment :-)

--

___
Python tracker 
<https://bugs.python.org/issue35457>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39187] urllib.robotparser does not respect the longest match for the rule

2020-01-01 Thread Andre Burgaud


Andre Burgaud  added the comment:

During testing identified a related issue that is fixed by the same sort 
function implemented to address the longest match rule.

This related problem also addressed by this change takes into account the 
situation when 2 equivalent rules (same path for allow and disallow) are found 
in the robotstxt. In such a situation allow should be used: 
https://tools.ietf.org/html/draft-koster-rep-00#section-2.2.2

--

___
Python tracker 
<https://bugs.python.org/issue39187>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39187] urllib.robotparser does not respect the longest match for the rule

2020-01-01 Thread Andre Burgaud


Change by Andre Burgaud :


--
keywords: +patch
pull_requests: +17227
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/17794

___
Python tracker 
<https://bugs.python.org/issue39187>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39187] urllib.robotparser does not respect the longest match for the rule

2020-01-01 Thread Andre Burgaud


New submission from Andre Burgaud :

As per the current Robots Exclusion Protocol internet draft, 
https://tools.ietf.org/html/draft-koster-rep-00#section-3.2. a robot should 
apply the rules respecting the longest match.

urllib.robotparser relies on the order of the rules in the robots.txt file. 
Here is the section in the specs:

===
3.2.  Longest Match

   The following example shows that in the case of a two rules, the
   longest one MUST be used for matching.  In the following case,
   /example/page/disallowed.gif MUST be used for the URI
   example.com/example/page/disallow.gif .

   
   User-Agent : foobot
   Allow : /example/page/
   Disallow : /example/page/disallowed.gif

===

I'm attaching a simple test file "test_robot.py"

--
components: Library (Lib)
files: test_robot.py
messages: 359181
nosy: gallicrooster
priority: normal
severity: normal
status: open
title: urllib.robotparser does not respect the longest match for the rule
type: behavior
versions: Python 3.8
Added file: https://bugs.python.org/file48815/test_robot.py

___
Python tracker 
<https://bugs.python.org/issue39187>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35457] robotparser reads empty robots.txt file as "all denied"

2020-01-01 Thread Andre Burgaud


Andre Burgaud  added the comment:

Hi,

Is this ticket still relevant for Python 3.8?

While running some tests with an empty robotstxt file I realized that it was 
returning "ALLOWED" for any path (as per the current draft of the Robots 
Exclusion Protocol: 
https://tools.ietf.org/html/draft-koster-rep-00#section-2.2.1 ")

Code:

from urllib import robotparser

robots_url = "file:///tmp/empty.txt"

rp = robotparser.RobotFileParser()
print(robots_url)
rp.set_url(robots_url)
rp.read()
print( "fetch /", rp.can_fetch(useragent = "*", url = "/"))
print( "fetch /admin", rp.can_fetch(useragent = "*", url = "/admin"))

Output:

$ cat /tmp/empty.txt
$ python -V
Python 3.8.1
$ python test_robot3.py
file:///tmp/empty.txt
fetch / True
fetch /admin True

--
nosy: +gallicrooster

___
Python tracker 
<https://bugs.python.org/issue35457>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com