Re: [sphinx-users] Linkcheck UnicodeEncodeError

Chris Barker Wed, 23 Mar 2016 16:06:31 -0700

On Tue, Mar 22, 2016 at 3:50 AM, Bernhard Grotz <[email protected]>
wrote:


>
> today I got a problem when calling ``make linkcheck`` inside of a sphinx
> project. Apparently, there is a link inside the project which contains the
> German Umlaut 'Ü' ('\xdc'). When reaching this URL, the linkcheck-builder
> raises
> an Error and stops working:
>

isn't that correct behavior? As I understand it, URLs can only contain
ASCII characters, though it seems there is movement to extend that to
"IRIs":

https://en.wikipedia.org/wiki/Internationalized_resource_identifier

So the question is -- what it Shinx's policy here? IF only to support the
older, robust ASCII only system then an Error does make sense here.

                split[2] = quote_plus(split[2].encode('utf-8'),
> '/').decode('ascii')
>         AttributeError: 'str' object has no attribute 'decode'
>

then this looks like a py2-py3 error.  The old py2 Unicode object has a
decode() method, even though that actually makes no sense, but backwards
compatibility and all that... that should  apparetnly be a bytes object at
that point, not a string.

(and maybe there would be no Error then, as encode_uri maybe CAN encode
that not-ascii character)

.. code-block:: python
>
>         # handle non-ASCII URIs
>         try:
>                 req_url.encode('ascii')
>         except UnicodeError:
>                 req_url = encode_uri(req_url)
>
> Obviously, "UnicodeError" should be replaced with "UnicodeEncodeError"
> there.
>

if py2 supports that -- otherwise, UnicodeEncodeError appears to be a
subclass of UnicodeError, so this still gets caught.


> But the problem still keeps the same. Only changing
> ``req_url.encode('ascii')``
> to ``req_url.encode('utf-8')`` helps as a workaround, but then of course
> the
> checks of all URLs containing German Umlaute fail.
>
> Is there a better way to fix this problem?
>

I'm wondering why encode_uri isn't just called every time, anyway, but in
any case, it looks like it needs some py3 testing and fixing... Taking a
quick look:

def encode_uri(uri):
    split = list(urlsplit(uri))
    split[1] = split[1].encode('idna').decode('ascii')

so split[1] should be a unicode object now -- all good

    split[2] = quote_plus(split[2].encode('utf-8'), '/').decode('ascii')

Here is where it barfed:

split[2].encode('utf-8') is now a bytes object -- good.

but we've getting an error on the decode() call, so:

quote_plus() must be returning a unicode object.

It looks like quote_plus() is coming from the six module:

urllib.quote(string[, safe])

Replace special characters in string using the %xx escape. Letters, digits,
and the characters '_.-' are never quoted. By default, this function is
intended for quoting the path section of the URL. The optional safe
parameter specifies additional characters that should not be quoted — its
default value is '/'.

Example: quote('/~connolly/') yields '/%7econnolly/'.

urllib.quote_plus(string[, safe])

Like quote(), but also replaces spaces by plus signs, as required for
quoting HTML form values when building up a query string to go into a URL.
Plus signs in the original string are escaped unless they are included in
safe. It also does not have safe default to '/'.

IF it's doing its job with non-ascii charactors, then it shoudl return an
ascii-compatible string (i.e. unicod object in py3), so the .decode()
should not be required.

of course, it may be there because it returns a py2string, and we want a
py2 unicode object.

personally, I think it's odd that it wouldn't return a Unicode string under
py2, but if that's the case, then this may need the ugly:

try:
    split[2].encode('ascii')
except AttributeError:
    pass

NOTE: I haven't tried to run any of this code.....

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[email protected]

-- 
You received this message because you are subscribed to the Google Groups 
"sphinx-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/sphinx-users.
For more options, visit https://groups.google.com/d/optout.

Re: [sphinx-users] Linkcheck UnicodeEncodeError

Reply via email to