[issue26118] String performance issue using single quotes

2016-01-15 Thread Steven D'Aprano

Steven D'Aprano added the comment:

On Fri, Jan 15, 2016 at 07:56:39AM +, poostenr wrote:
> As I did more testing I noticed that appending data to the file slowed 
> down. The file grew initially with ~30-50KB increments and around 
> 500KB it had slowed down to ~3-5KB/s, until around 1MB the file grew 
> at ~1KB/s.

How are you appending to the file? In the code snippets below you merely 
say "append s to file", which is not Python code and could hide a 
multitude of sins. You're asking us to diagnose slow performance in code 
we can't see.

Or perhaps your disk is just badly fragmented, and as the file gets 
bigger, performance gets worse.

> Did conv.escape_string() change something about columnvalue

I don't know. What's conv.escape_string? Where does it come from, and 
what does it do? Again, you're asking us to diagnose code we can't see.

Perhaps you should run the profiler and see where your code is spending 
most of its time.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-15 Thread poostenr

poostenr added the comment:

Thank you for your feedback Victor and Steven.

I just copied my scripts and 360MB of CSV files over to Linux.
The entire process finished in 4 minutes exactly, using the original python 
scripts.
So there is something different between my environments.
If it was a fragmentation issue, then I would expect to always have a slow 
performance on the Windows system. But I can influence the performance by 
alternating between the two original statements:
s = "{0},".format(columnvalue)   # fast
s = "'{0}',".format(columnvalue) # ~30x slower

I apologize for not being able to provide the entire code.
There is too much code to post at this time.

I am opening a file like this:
#logger = open(filename, rw, buffering, encoding)
logger = open('output.sql', 'a', 1, 'iso-8859-1')

I write to file:
logger.write(text+'\n')

I'm using a library to escape the string before saving to file.
import pymysql.converters as conv
<...>
for key in listkeys:
keyvalue = self.recordstats[key]
fieldtype   = keyvalue[0]
columnvalue = record[key]
columnvalue = conv.escape_string(columnvalue)
if (count > 1):
s = "{0},".format(columnvalue)  # No single quotes
else
s = "{0},".format(columnvalue)  # No single quotes
count -= 1
logger.write(s+'\n')

I appreciate the feedback and ideas so far.
Trying the profiler is on my list to see if it provides more insight.
I am not using Anaconda3 on Linux. Perhaps that has an impact somehow?

I never suspected inserting the two single quotes to cause such a problem in 
performance. I noticed it when I parsed ~40GB of data and it took almost a week 
to complete instead of my expected 6-7 hrs.
Just the other day I decided to remove the single quotes because it was the 
only thing left that I'd changed. I had discarded that change the past two 
weeks because that couldn't be causing the performance problem.

Today, I wasn't expecting such a big difference between running my script on 
Linux or Windows.

If I discover anything else, I will post an update.
When I get the chance I can remove redundant code and post the source.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-15 Thread SilentGhost

SilentGhost added the comment:

poostenr, this is demonstrably not a problem with the CPython, which this bug 
tracker is about. There are few options available on the internet if you need 
help with your code: mailing lists and irc are among them.

--
nosy: +SilentGhost
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-15 Thread STINNER Victor

STINNER Victor added the comment:

I implemented overkill optimization in _PyUnicodeWriter API used by 
str.format() and str%args. If the result is the input, the string is not copied 
by value, but by reference.

>>> x="hello"
>>> ("%s" % x) is x
True
>>> ("{}".format(x)) is x
True

If the format string adds something before/after, the string is duplicated:

>>> ("x=%s" % x) is x
False
>>> ("x={}".format(x)) is x
False

The optimization is implemented in _PyUnicodeWriter_WriteStr():

https://hg.python.org/cpython/file/32a4e7b337c9/Objects/unicodeobject.c#l13604

--
nosy: +haypo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-15 Thread STINNER Victor

STINNER Victor added the comment:

> If you see a factor of 30x difference in your code, I suspect it's not 
> related to str.format(), but some other processing in your code.

The performance of instructions like ("x=%s" % x) or ("x={}".format(x)) depend 
on the length of the string. Maybe poostenr has some very long strings?

--

Closed issues related to _PyUnicodeWriter API:

* issue #14716: Use unicode_writer API for str.format()
* issue #14687: Optimize str%tuple for the PEP 393
* issue #14744: Use _PyUnicodeWriter API in str.format() internals
* issue #16147: Rewrite PyUnicode_FromFormatV() to use the _PyUnicodeWriter API
* issue #17694: Enhance _PyUnicodeWriter API to control minimum buffer length 
without overallocation
* issue #19513: Use PyUnicodeWriter instead of PyAccu in repr(tuple) and 
repr(list)
* issue #19646: Use PyUnicodeWriter in repr(dict)

I'm listing these issues because they contain microbenchmarks script if you 
would like to play with them.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-14 Thread Steven D'Aprano

Steven D'Aprano added the comment:

I cannot replicate that performance difference under Linux. There's a small 
difference (about 0.1 second per million iterations, or a tenth of a 
microsecond) on my computer, but I don't think that's meaningful:

py> from timeit import Timer
py> t1 = Timer('"{0},".format(999)')
py> t2 = Timer('"\'{0}\',".format(999)')
py> min(t1.repeat())
4.671058462932706
py> min(t2.repeat())
4.774653751403093


Please re-run your tests using the timeit module, and see if you can still see 
a consistent difference with and without single quotes. Perhaps this is 
specific to Windows?

Otherwise, I can only suggest that the timing difference is unrelated to the 
difference in quotes in the script. Are you sure that this is absolutely the 
only change between the run that took one minute and the run that took 30 
minutes? No other changes to the script, running on the same data file, on the 
same disk, no difference in what other processes are running? (E.g. if one run 
is fighting for disk access with, say, a memory-hungry anti-virus scan, that 
would easily explain the difference.)

--
nosy: +steven.daprano

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-14 Thread poostenr

poostenr added the comment:

Eric, Steven, thank you for your feedback so far.

I am using Windows7, Intel i7.
That one particular file of 6.5MB took ~1 minute on my machine.
When I ran that same test on Linux with Python 3.5.1, it took about 3 seconds. 
I was amazed to see a 20x difference.

Steven suggested the idea that this phenomenon might be specific to Windows. 
And I agree, that is what it is looking like. Or is Python doing something in 
the background?

The Python script is straight forward with a loop that reads a line from a CSV 
file, split the column values and saves each value as '' to another 
file. Basically building an SQL statement.
I have had no issues until I added the encapsulating single quotes around the 
value.

Because I can reproduce this performance difference at will by alternating 
which line I comment out, leads me to believe it cannot be HDD, AV or something 
outside the python script interfering.

I repeated the simplified test, that I ran earlier on a Linux system, but this 
time on my Windows system.
I don't see anything spectacular.
I am just puzzled that using one statement or the other causes such a huge 
performance impact somehow.

I will try some more tests and copy your examples.

import time
loopcount = 1000

# Using string value
s="test 1"
v="test 1"
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "{0}".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 1: 1452828394523
# End   test 1: 1452828397957
# Diff  test 1: 3434 ms


s="test 2"
v="test 2"
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "'%s'" % (v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 2: 1452828397957
# End   test 2: 1452828401233
# Diff  test 2: 3276 ms


s="test 3"
v="test 3"
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "'{0}'".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 3: 1452828401233
# End   test 3: 1452828406320
# Diff  test 3: 5087 ms

# Using integer value
s="test 4"
v=123456
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "{0}".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 4: 1452828406320
# End   test 4: 1452828411378
# Diff  test 4: 5058 ms


s="test 5"
v=123456
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "'%s'" % (v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 5: 1452828411378
# End   test 5: 1452828415264
# Diff  test 5: 3886 ms

s="test 6"
v=123456
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "'{0}'".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 6: 1452828415264
# End   test 6: 1452828421292
# Diff  test 6: 6028 ms

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-14 Thread poostenr

poostenr added the comment:

Eric,

I just tried your examples. 
The loop count is 100x more, but the results are about a factor 10 off.

Test1:
My results:
C:\Data>python -m timeit -s 'x=4' '",{0}".format(x)'
1 loops, best of 3: 0.0116 usec per loop 

Eric's results:
$ python -m timeit -s 'x=4' '",{0}".format(x)'
100 loops, best of 3: 0.182 usec per loop

Test2:
My results:
C:\Data>python -m timeit -s 'x=4' '"\'{0}\',".format(x)'
1 loops, best of 3: 0.0122 usec per loop 

Eric's results:
$ python -m timeit -s 'x=4' '"'\''{0}'\'',".format(x)'
100 loops, best of 3: 0.205 usec per loop

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-14 Thread Eric V. Smith

Eric V. Smith added the comment:

I see a small difference, but I think it's related to the fact that in the 
first example you're concatenating 2 strings (',' and the result of {0}), and 
in the 2nd example it's 3 strings ("'", {0}, "',"):

$ echo '",{0}".format(x)'
",{0}".format(x)
$ python -m timeit -s 'x=4' '",{0}".format(x)'
100 loops, best of 3: 0.182 usec per loop

$ echo '"'\''{0}'\'',".format(x)'
"'{0}',".format(x)
$ python -m timeit -s 'x=4' '"'\''{0}'\'',".format(x)'
100 loops, best of 3: 0.205 usec per loop

If you see a factor of 30x difference in your code, I suspect it's not related 
to str.format(), but some other processing in your code.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-14 Thread Eric V. Smith

Eric V. Smith added the comment:

Please show us how you're measuring the performance. Also, please show the 
output of "python -V".

--
components: +Interpreter Core -Windows
nosy: +eric.smith -paul.moore, steve.dower, tim.golden, zach.ware

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-14 Thread ubehera

Changes by ubehera :


--
nosy: +ubehera

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-14 Thread poostenr

poostenr added the comment:

My initial observations with my Python script using:

s = "{0},".format(columnvalue)   # fast
Processed ~360MB of data from 2:16PM - 2:51PM (35 minutes, ~10MB/min)
One particular file 6.5MB took ~1 minute.

When I changed this line of code to:
s = "'{0}',".format(columnvalue) # ~30x slower (1 min. vs 30 min.)
Same particular file of 6.5MB took ~30 minutes (228KB/min).

My Python environment is:
C:\Data>python -V
Python 3.5.1 :: Anaconda 2.4.1 (32-bit)

I did some more testing with a very simplified piece of code, but is not 
conclusive. But there is a significant jump when I introduce the single quotes 
where I see a jump from 0m2.410s to 0m3.875s.

$ python -V
Python 3.5.1

// 
// s='test'
// for x in range(1000):
// y = "{0}".format(s)

$ time python test.py

real0m2.410s
user0m2.356s
sys 0m0.048s

// s='test'
// for x in range(1000):
// y = "'%s'" % (s)

$ time python test2.py 

real0m2.510s
user0m2.453s
sys 0m0.051s

// s='test'
// for x in range(1000):
// y = "'{0}'".format(s)

$ time python test3.py 

real0m3.875s
user0m3.819s
sys 0m0.048s

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-14 Thread poostenr

New submission from poostenr:

There appears to be a significant performance issue between the following two 
statements. Unable to explain performance impact.

s = "{0},".format(columnvalue)   # fast
s = "'{0}',".format(columnvalue) # ~30x slower

So far, no luck trying to find other statements to improve performance, such as:
s = "\'{0}\',".format(columnvalue)
s = "'" + "%s" %(columnvalue) + "'"+","
s = "{0}{1}{2},".format("'",columnvalue,"'")

--
components: Windows
messages: 258243
nosy: paul.moore, poostenr, steve.dower, tim.golden, zach.ware
priority: normal
severity: normal
status: open
title: String performance issue using single quotes
type: performance
versions: Python 3.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-14 Thread poostenr

poostenr added the comment:

Eric, Steven,

During further testing I was not able to find any real evidence that the 
statement I was focused on had a real performance issue.

As I did more testing I noticed that appending data to the file slowed down. 
The file grew initially with ~30-50KB increments and around 500KB it had slowed 
down to ~3-5KB/s, until around 1MB the file grew at ~1KB/s. I found this to be 
odd and because Steven had mentioned other processes, I started looking at some 
other statements.

After quite a lot of trial and error, I was able to use single quotes and 
increase my performance to acceptable levels.
Example 3 below is how I resolved it.

Can you explain to me why there was a performance penalty in example 2 ?
Python did something under the hood that I am overlooking.

Did conv.escape_string() change something about columnvalue, so that adding a 
single quote before and after it introduced some add behavior with writing to 
file ? I am not an expert on Python and remember reading something about 
Dynamic typing. 

Example 1: Fast performance, variable s is not encapsulated with single quotes
6.5MB parsed in ~1 minute.
for key in listkeys:
keyvalue = self.recordstats[key]
fieldtype   = keyvalue[0]
columnvalue = record[key]
columnvalue = conv.escape_string(columnvalue)
if (count > 1):
s = "{0},".format(columnvalue)  # No single quotes
else
s = "{0},".format(columnvalue)  # No single quotes
count -= 1
Append s to file.

Example 2: Slow performance, pre- and post-fixed variable s with single quotes
6.5MB parsed in 35 minutes.
for key in listkeys:
keyvalue = self.recordstats[key]
fieldtype   = keyvalue[0]
columnvalue = record[key]
columnvalue = conv.escape_string(columnvalue)
if (count > 1):
s = "'{0}',".format(columnvalue) # Added single quotes
else
s = "'{0}',".format(columnvalue) # Added single quotes
count -= 1
Append s to file.

Example 3: Fast performance, variable columnvalue is pre- and post-fixed with 
single quotes
6.5MB parsed in !45 seconds.
for key in listkeys:
keyvalue = self.recordstats[key]
fieldtype   = keyvalue[0]
columnvalue = record[key]
columnvalue = conv.escape_string("'" + columnvalue + "'") # Moved single 
quotes to this statement.
if (count > 1):
s = "{0},".format(columnvalue)
else
s = "{0},".format(columnvalue)
count -= 1
Append s to file.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com