[issue4191] urlparse normalize URL path

2008-10-24 Thread monk.e.boy

New submission from monk.e.boy [EMAIL PROTECTED]:

Hi,

  The way urljoin works is a bit funky, equivalent paths do not get
cleaned in a consistent way:


import urlparse
import posixpath

print urlparse.urljoin('http://www.example.com', '///')
print urlparse.urljoin('http://www.example.com/', '///')
print urlparse.urljoin('http://www.example.com///', '///')
print urlparse.urljoin('http://www.example.com///', '//')
print urlparse.urljoin('http://www.example.com///', '/')
print urlparse.urljoin('http://www.example.com///', '')
print
# the above should reduce down to:
print posixpath.normpath('///')
print
print urlparse.urljoin('http://www.example.com///', '.')
print urlparse.urljoin('http://www.example.com///', '/.')
print urlparse.urljoin('http://www.example.com///', './')
print urlparse.urljoin('http://www.example.com///', '/.')
print
print posixpath.normpath('/.')
print
print urlparse.urljoin('http://www.example.com///', '..')
print urlparse.urljoin('http://www.example.com', '/a/../a/')
print urlparse.urljoin('http://www.example.com', '../')
print urlparse.urljoin('http://www.example.com', 'a/../a/')
print urlparse.urljoin('http://www.example.com', 'a/../a/./')
print urlparse.urljoin('http://www.example.com/a/../a/', '../a/./../a/')
print urlparse.urljoin('http://www.example.com/a/../a/', '/../a/./../a/')

The results of the above code are:

http://www.example.com/
http://www.example.com/
http://www.example.com/
http://www.example.com///
http://www.example.com/
http://www.example.com///

/

http://www.example.com///
http://www.example.com/.
http://www.example.com///
http://www.example.com/.

/

http://www.example.com
http://www.example.com/.
http://www.example.com
http://www.example.com/.

http://www.example.com//
http://www.example.com/a/../a/
http://www.example.com/../
http://www.example.com/a/
http://www.example.com/a/
http://www.example.com/a/
http://www.example.com/../a/./../a/

Sometimes the path is cleaned, sometimes it is not. When it is cleaned,
the cleaning process is not perfect.

The bit of code that is causing problems is commented:

  # XXX The stuff below is bogus in various ways...

If I may be so bold, I would like to see this URL cleaning code stripped
from urljoin.

A new method/function could be added that cleans a URL. It could have a
'mimic browser' option, because a browser *will* follow URLs like:
http://example.com/../../../ (see this non-bug
http://bugs.python.org/issue2583 )

The URL cleaner could use some of the code from posixpath. Shorter
URLs would be preferred over longer (e.g: http://example.com preferred
to http://example.com/ )

Thanks,

monk.e.boy

--
messages: 75154
nosy: monk.e.boy
severity: normal
status: open
title: urlparse normalize URL path
versions: Python 2.6

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue4191
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2583] urlparse normalize URL path

2008-04-08 Thread monk.e.boy

New submission from monk.e.boy [EMAIL PROTECTED]:

Hi,
  This is my first problem with anything Python :-) and my first issue.

  Doing in the following:

  urlparse.urljoin( 'http://site.com/', '../../../../path/' )
  'http://site.com/../../../../path/'

  urlparse.urljoin( 'http://site.com/', '/path/../path/.././path/./' )
  'http://site.com/path/../path/.././path/./'

These URLs are normalized to http://site.com/path/ in both Firefox and
Google (the google spider would follow these OK)

  I think the documentation could be improved to point at the
posixpath.py normpath function and how it solves the above. I blogged a
how to:

http://teethgrinder.co.uk/blog/Normalize-URL-path-python/

I hope my bug report is OK. Thanks for all the code :-)

[EMAIL PROTECTED]

--
components: Library (Lib)
messages: 65162
nosy: monk.e.boy
severity: normal
status: open
title: urlparse normalize URL path
type: behavior
versions: Python 2.5

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2583
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com