[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-13 Thread STINNER Victor


STINNER Victor  added the comment:

> Yes, that slows down Python 3.7.0a3 to the 3.7.0a4 level.

Ok, so something calls setlocale(LC_CTYPE, "") or setlocale(LC_ALL, "") in 
Python 3.7.0. I'm not interested to dig the Git history. It doesn't really 
matter at this point.

Can you try to get the current LC_CTYPE locale on Python 3.6 or 3.7.0a3? 
Example:

$ python3 -c 'import _locale; print(_locale.setlocale(_locale.LC_CTYPE, None))'
fr_FR.utf8

Or using the locale module (it should give the same result):

$ python3 -c 'import locale; print(locale.setlocale(locale.LC_CTYPE, None))'
fr_FR.utf8

Can also also try on 3.7.0a4 and newer (ex: Python 3.7.1): 
locale.setlocale(locale.LC_ALL, "C"), to see it does workaround your 
performance issue? I don't recall if "C" or "POSIX" locales are supported on 
Windows.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-12 Thread Christoph Gohlke


Christoph Gohlke  added the comment:

> test_isdigit.c: Can you try to call locale.setlocale(locale.LC_CTYPE, "") 
> before running your benchmark on Python 3.7.0?

Yes, that slows down Python 3.7.0a3 to the 3.7.0a4 level.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-12 Thread STINNER Victor


STINNER Victor  added the comment:

> This issue may be related to bpo-34485.

I'm thinking to:

New changeset 177d921c8c03d30daa32994362023f777624b10d by Victor Stinner in 
branch 'master':
bpo-34485, Windows: LC_CTYPE set to user preference (GH-8988)
https://github.com/python/cpython/commit/177d921c8c03d30daa32994362023f777624b10d

Oh, I only made this change in the future Python 3.8 (master branch). So this 
change may be unrelated.

Note: Right now, my Windows VM is broken, so I cannot investigate this 
performance issue which seems to be specific to the msvcrt (libc of Microsoft 
VisualStudio).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-12 Thread STINNER Victor


STINNER Victor  added the comment:

test_isdigit.c: Can you try to call locale.setlocale(locale.LC_CTYPE, "") 
before running your benchmark on Python 3.7.0?

This issue may be related to bpo-34485.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-12 Thread Christoph Gohlke


Christoph Gohlke  added the comment:

I attached a minimal C extension module that can be used to demonstrate the 
performance degradation from Python 3.7.0a3 to 3.7.0a4.

Build the extension with `py setup.py build_ext --inplace`, then run the 
following code on Python 3.7.0a3 to 3.7.0a4:

```
import time
from test_isdigit import test_isdigit

start_time = time.time()
test_isdigit()
print(time.time() - start_time)
```

On my Windows 10 Pro WS system, the timings are:

Python 3.7.0a3: ~0.0156
Python 3.7.0a4: ~0.3281


I would expect that other locale aware functions in the UCRT are also affected 
but I have not tested that.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-12 Thread STINNER Victor


STINNER Victor  added the comment:

> digits = ''.join([str(i) for i in range(10)]*1000)
> %timeit digits.isdigit() # --> 2X+ slower on python 3.7.1

This code calls:

* (Python) str.isdigit()
* unicode_isdigit_impl()
* _PyUnicode_IsDigit()
* _PyUnicode_ToDigit() which uses Python internal Unicode database

This code doesn't depend on locales at all. It's pure Unicode.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-12 Thread Dragoljub


Dragoljub  added the comment:

Here is a simple pure python example:

digits = ''.join([str(i) for i in range(10)]*1000)
%timeit digits.isdigit() # --> 2X+ slower on python 3.7.1

Basically in Pandas C-code parser we call the isdigit() function for each 
number that is to be parsed. so 12345.6789 calls isdigt() 9 times to determine 
if this is a digit character that can be converted to a float. The problem is 
in the latest version of Python with locale updates isdigit() takes a locale 
argument that seems to be passed over and over slowing down this check. Is it 
possible that we disable any local passing from Python down to lower-level C 
code, or simply set the default locale to 'C' to keep it from thrashing?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-12 Thread Christoph Gohlke


Change by Christoph Gohlke :


Added file: https://bugs.python.org/file47929/setup.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-12 Thread Christoph Gohlke


Change by Christoph Gohlke :


Added file: https://bugs.python.org/file47928/test_isdigit.c

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-12 Thread Christoph Gohlke


Christoph Gohlke  added the comment:

> Can someone please try to write an example which only uses the stdlib?

The simplest is to compare performance of the 
`windll.LoadLibrary('API-MS-WIN-CRT-STRING-L1-1-0.DLL')` function on Python 
3.7.0a3 and 3.7.0a4, but that will mostly measure Python/ctypes overhead. I 
will post a minimal C extension instead.


> What are these extensions? Where do them come from?

The `isdigit` function is from the UCRT. The `parsers` Cython/C extension is 
part of the pandas wheel on PyPI. The context for this issue is at 
https://github.com/pandas-dev/pandas/issues/23516


> I don't understand which "locale changes" you are talking about. You can 
> change the locale using locale.setlocale().

The `UCRT.isdigit` function, when run on Python >=3.7.0a4, calls the 
`_isdigit_l` function, which calls `_LocaleUpdate::_LocaleUpdate` (see the VS 
profiler output).

--
nosy: +cgohlke

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-12 Thread STINNER Victor


STINNER Victor  added the comment:

Can someone please try to write an example which only uses the stdlib?

> The culprit is the isdigit function called in the parsers extension module.

What are these extensions? Where do them come from?

> Any way you can help test out a config setting to avoid the locale changes on 
> Python 3.7.0a4+?

(I fixed 2.7 => 3.7)

I don't understand which "locale changes" you are talking about. You can change 
the locale using locale.setlocale().

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-12 Thread Dragoljub


Dragoljub  added the comment:

@Vstinner,

Any way you can help test out a config setting to avoid the locale changes on 
Python 2.7.0a4+? It is currently causing the isdigit() low-level function to 
call the local-specific function on windows and update locals each call slowing 
down CSV Paring on Windows 3.5X

How can we configure python to not be different than 3.6.7 when it come to 
locale behavior?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-10 Thread Dragoljub


Dragoljub  added the comment:

@cgohlke compared the statement df2 = pd.read_csv(csv) on Python 3.7.0a3 and a4 
in the Visual Studio profiler. The culprit is the isdigit function called in 
the parsers extension module. On 3.7.0a3 the function is fast at ~8% of 
samples. On 3.7.0a4 the function is slow at ~64% samples because it calls the 
_isdigit_l function, which seems to update and restore the locale in the 
current thread every time...

3.7.0a3:
Function Name   Inclusive Samples   Exclusive Samples   Inclusive 
Samples % Exclusive Samples % Module Name
 + [parsers.cp37-win_amd64.pyd] 705 347 28.52%  14.04%  
parsers.cp37-win_amd64.pyd
   isdigit  207 207 8.37%   8.37%   ucrtbase.dll
 - _errno   105 39  4.25%   1.58%   ucrtbase.dll
   toupper  24  24  0.97%   0.97%   ucrtbase.dll
   isspace  21  21  0.85%   0.85%   ucrtbase.dll
   [python37.dll]   1   1   0.04%   0.04%   python37.dll
3.7.0a4:
Function Name   Inclusive Samples   Exclusive Samples   Inclusive 
Samples % Exclusive Samples % Module Name
 + [parsers.cp37-win_amd64.pyd] 8,613   478 83.04%  4.61%   
parsers.cp37-win_amd64.pyd
 + isdigit  6,642   208 64.04%  2.01%   ucrtbase.dll
 + _isdigit_l   6,434   245 62.03%  2.36%   ucrtbase.dll
 + _LocaleUpdate::_LocaleUpdate 5,806   947 55.98%  9.13%   ucrtbase.dll
 + __acrt_getptd2,121   1,031   20.45%  9.94%   ucrtbase.dll
   FlsGetValue  647 647 6.24%   6.24%   KernelBase.dll
 - RtlSetLastWin32Error 296 235 2.85%   2.27%   ntdll.dll
   _guard_dispatch_icall_nop101 101 0.97%   0.97%   ucrtbase.dll
   GetLastError 46  46  0.44%   0.44%   KernelBase.dll
 + __acrt_update_multibyte_info 1,475   246 14.22%  2.37%   ucrtbase.dll
 - __crt_state_management::get_current_state_index  1,229   513 11.85%  
4.95%   ucrtbase.dll
 + __acrt_update_locale_info1,263   235 12.18%  2.27%   ucrtbase.dll
 - __crt_state_management::get_current_state_index  1,028   429 9.91%   
4.14%   ucrtbase.dll
   _ischartype_l383 383 3.69%   3.69%   ucrtbase.dll

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-09 Thread Dragoljub


Dragoljub  added the comment:

I tested this at runtime with sys._enablelegacywindowsfsencoding()

Also this was new in 3.6 and Py 3.6 does not have the slowdown issue.

New in version 3.6: See PEP 529 for more details.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-09 Thread Karthikeyan Singaravelan


Karthikeyan Singaravelan  added the comment:

I have limited understanding of Windows and I don't have access to a Windows 
machine to check this out. I am adding Victor who implemented the PEP and might 
help here. There also seems to be PYTHONLEGACYWINDOWSFSENCODING for windows 
specific use case. Some more notes on the original issue 
https://bugs.python.org/issue29240#msg285278.

https://bugs.python.org/issue29240#msg285325

> Handle PYTHONLEGACYWINDOWSFSENCODING: this env var now disables the UTF-8 
> mode and has the priority over -X utf8 and PYTHONUTF8

--
nosy: +vstinner

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-09 Thread Dragoljub


Dragoljub  added the comment:

I tried playing around with the UTF-8 mode settings but did not get a speed 
improvement.

After reading through the PEP it appears that on Windoes:

"To allow for better cross-platform binary portability and to adjust 
automatically to future changes in locale availability, these checks will be 
implemented at runtime on all platforms other than Windows, rather than 
attempting to determine which locales to try at compile time."

So if i'm understanding this correctly the locale coercion would not be 
controllable from Windows after Python is compiled?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-09 Thread Karthikeyan Singaravelan


Karthikeyan Singaravelan  added the comment:

>From the PEP 540

This mode is off by default, but is automatically activated when using the 
"POSIX" locale.

Add the -X utf8 command line option and PYTHONUTF8 environment variable to 
control UTF-8 Mode.

https://docs.python.org/3.7/using/cmdline.html#envvar-PYTHONUTF8

I think you can set it to 0 from the docs yo see if it has any effect.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-09 Thread Dragoljub


Dragoljub  added the comment:

After some more digging it appears that we see the 3.5x slowdown manifest in 
Python 3.7.0a4 and is not present in Python 3.7.0a3.

One guess is that 

https://docs.python.org/3.7/whatsnew/changelog.html#python-3-7-0-alpha-4

bpo-29240: Add a new UTF-8 mode: implementation of the PEP 540

may contribute to this slowdown on windows. Is there a way to ensure we disable 
any native to UTF conversion that may be happening in Python 3.7.a4?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-09 Thread Dragoljub


Dragoljub  added the comment:

After some more benchmarks I'm seeing this line of code called in Python 3.7 
but not in Python 3.5:

{built-in method _thread.allocate_lock}

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-08 Thread Karthikeyan Singaravelan


Change by Karthikeyan Singaravelan :


--
nosy: +xtreak

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-08 Thread Dragoljub


New submission from Dragoljub :

xref: https://github.com/pandas-dev/pandas/issues/23516

Example:
import io
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 10), columns=('COL{}'.format(i) for 
i in range(10)))
csv = io.StringIO(df.to_csv(index=False))
df2 = pd.read_csv(csv) #3.5X slower on Python 3.7.1

pd.read_csv() reads data at 30MB/sec on Python 3.7.1 while at 100MB/sec on 
Python 3.6.7.

This issue seems to be only present on Windows 10 Builds both x86 & x64. 

Possibly some IO changes in Python 3.7 could have contributed to this slowdown 
on Windows but not on Linux?

--
components: IO
messages: 329490
nosy: Dragoljub
priority: normal
severity: normal
status: open
title: Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 
On Windows 10
type: performance
versions: Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com