[issue14811] decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")

2021-04-13 Thread Pablo Galindo Salgado


Pablo Galindo Salgado  added the comment:

Ok, let's continue the discussion on https://bugs.python.org/issue38755

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14811] decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")

2021-04-13 Thread Eryk Sun


Eryk Sun  added the comment:

> So that means we can close the issue, no?

This is a bug in 3.8 and 3.9, which need the fix to keep reading until "\n" is 
seen on the line. I arrived at this issue via bpo-38755 if you think it should 
be addressed there, but it's the same bug that's reported here.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14811] decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")

2021-04-13 Thread STINNER Victor


STINNER Victor  added the comment:

With https://bugs.python.org/issue14811#msg160706 I get a SyntaxError on Python 
3.7, 3.8, 3.9 and 3.10.0a6. But I don't get an error on the master branch 
(Python 3.10.0a7+).

Eryk:
> The latest alpha release, 3.10a7, includes your rewrite of the tokenizer, and 
> in that case t33a.py no longer fails in Windows.

Oh ok, this issue was fixed by the following commit which is part of v3.10.0a7 
release:

commit 261a452a1300eeeae1428ffd6e6623329c085e2c
Author: Pablo Galindo 
Date:   Sun Mar 28 23:48:05 2021 +0100

bpo-25643: Refactor the C tokenizer into smaller, logical units (GH-25050)

--
resolution:  -> duplicate
stage: needs patch -> resolved
status: open -> closed
superseder:  -> Python tokenizer rewriting

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14811] decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")

2021-04-13 Thread Pablo Galindo Salgado


Pablo Galindo Salgado  added the comment:

> no longer fails in Windows.

So that means we can close the issue, no?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14811] decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")

2021-04-13 Thread Eryk Sun


Eryk Sun  added the comment:

> I don't get any error executing the t33a.py script

The second line in t33a.py is 1618 bytes. The standard I/O BUFSIZ in Linux is 
8192 bytes, but it's only 512 bytes in Windows. The latest alpha release, 
3.10a7, includes your rewrite of the tokenizer, and in that case t33a.py no 
longer fails in Windows.

--
nosy: +eryksun

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14811] decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")

2021-04-13 Thread Pablo Galindo Salgado


Pablo Galindo Salgado  added the comment:

I don't get any error executing the t33a.py script

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14811] decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")

2021-04-13 Thread STINNER Victor


Change by STINNER Victor :


--
nosy: +BTaskaya, lys.nikolaou, pablogsal

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14811] decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")

2021-04-13 Thread Eryk Sun


Change by Eryk Sun :


--
versions: +Python 3.8, Python 3.9 -Python 2.7, Python 3.2, Python 3.3, Python 
3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14811] decoding_fgets() truncates long lines and fails with a SyntaxError(Non-UTF-8 code starting with...)

2012-11-04 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
stage:  - needs patch
versions: +Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14811
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14811] decoding_fgets() truncates long lines and fails with a SyntaxError(Non-UTF-8 code starting with...)

2012-08-01 Thread STINNER Victor

STINNER Victor added the comment:

 Are we going to fix this before 3.3? Any objections to Victor's patch?

detect_truncate.patch is now raising an error if a line is longer than BUFSIZ, 
whereas Python supports lines longer than BUFSIZ bytes (it's just that the 
encoding cookie is ignored if the line 1 or 2 is longer than BUFSIZ bytes). So 
my patch is not correct.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14811
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14811] decoding_fgets() truncates long lines and fails with a SyntaxError(Non-UTF-8 code starting with...)

2012-07-19 Thread Hynek Schlawack

Hynek Schlawack h...@ox.cx added the comment:

Are we going to fix this before 3.3? Any objections to Victor's patch?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14811
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14811] decoding_fgets() truncates long lines and fails with a SyntaxError(Non-UTF-8 code starting with...)

2012-05-16 Thread STINNER Victor

STINNER Victor victor.stin...@gmail.com added the comment:

 Function decoding_fgets (Parser/tokenizer.c) reads line in buffer
 of fixed size 8192 (line truncated to size 8191) and then fails
 because line is cut in the middle of a multibyte UTF-8 character.

It looks like BUFSIZ is much smaller than 8192 on Windows: it's maybe only 1024 
bytes.

Attached patch detects when a line is truncated (longer than the internal 
buffer).

A better solution is maybe to reallocate the buffer if the string is longer 
than the buffer (write a universal fgets which allocates the buffer while the 
line is read). Most functions parsing Python source code uses a dynamic buffer. 
For example import module now reads the whole file content before parsing it 
(see FileLoader.get_data() in Lib/importlib/_bootstrap.py).

At least, we should use a longer buffer on Windows (ex: use 8192 on all 
platforms?).

I only found two functions parsing the a Python file line by line: 
PyRun_InteractiveOneFlags() and PyRun_FileExFlags(). There are many variant of 
these functions (ex: PyRun_InteractiveOne and PyRun_File). These functions are 
part of the C Python API and used by programs to execute Python code when 
Python is embeded in a program.

PS: As noticed by Serhiy Storchaka, the bug is not specific to Windows. It's 
just that the internal buffer is much smaller on Windows.

--
components: +Interpreter Core -Windows
keywords: +patch
nosy: +haypo
title: Syntax error on long UTF-8 lines - decoding_fgets() truncates long 
lines and fails with a SyntaxError(Non-UTF-8 code starting with...)
Added file: http://bugs.python.org/file25605/detect_truncate.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14811
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com