[issue5445] codecs.StreamWriter.writelines problem when passed generator

2009-03-08 Thread Daniel Lescohier

New submission from Daniel Lescohier :

This is the implementation of codecs.Streamwriter.writelines for all 
Python versions that I've checked:

def writelines(self, list):

""" Writes the concatenated list of strings to the stream
using .write().
"""
self.write(''.join(list))

This may be a problem if the 'list' parameter is a generator. The 
generator may be returning millions of values, which the join will 
concatenate in memory. It can surprise the programmer with large 
memory use. I think join should only be used when you know the size of 
your input, and this method does not know this. I think the safe 
implementation of this method would be:

def writelines(self, list):

""" Writes the concatenated list of strings to the stream
using .write().
"""
write = self.write
for value in list:
write(value)

If a caller knows that it's input list would use a reasonable amount 
of memory, it can get the same functionality as before by doing 
stream.write(''.join(list)).

--
components: Library (Lib)
message_count: 1.0
messages: 83322
nosy: dlesco
nosy_count: 1.0
severity: normal
status: open
title: codecs.StreamWriter.writelines problem when passed generator
versions: Python 2.5, Python 2.6, Python 2.7, Python 3.0

___
Python tracker 
<http://bugs.python.org/issue5445>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5448] Add precision property to decimal.Decimal

2009-03-08 Thread Daniel Lescohier

New submission from Daniel Lescohier :

I would like to get the decimal precision of decimal.Decimal objects 
(for my application, in order to validate whether a Decimal value will 
fit in the defined decimal precision of a database field). The way I 
found to get it was with:

precision = len(value._int)

where value is a decimal.Decimal object.

However, this uses a private API (_int). I would like to have a public 
API to get the decimal precision of a Decimal.  I propose we add the 
following to the decimal.Decimal class:

@property
def precision(self):
"""The precision of this Decimal value."""
return len(self._int)

decimal.Context has a precision for calculations. 
decimal.Decimal.precision is the minimum precision needed to represent 
that value, not the precision that was used in calculating it.  If one 
wants to, one can actually use Decimal.precision to set your Context's 
precision:

d1 = decimal.Decimal('999')
d2 = d1
context.prec = d1.precision + d2.precision
d3 = d1 * d2

Open for debate is whether to name it Decimal.prec to mirror 
Context.prec.  We'd have to choose one or the other or both:

@property
def precision(self):
"""The precision of this Decimal value."""
return len(self._int)
prec = precision

--
components: Library (Lib)
message_count: 1.0
messages: 83328
nosy: dlesco
nosy_count: 1.0
severity: normal
status: open
title: Add precision property to decimal.Decimal
versions: Python 2.4, Python 2.5, Python 2.6, Python 2.7, Python 3.0, Python 3.1

___
Python tracker 
<http://bugs.python.org/issue5448>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5448] Add precision property to decimal.Decimal

2009-03-09 Thread Daniel Lescohier

Daniel Lescohier  added the comment:

I had other code to check scale, but you are right, I should use 
quantize. There is certainly a lot to absorb in the IBM decimal 
specification.  I really appreciate you pointing me to quantize and 
Inexact. I guess I inadvertently used the issue tracker for help on 
the decimal module, I didn't really mean to do that, I really thought 
there was a need for Decimal.precision. The other unrelated issue I 
entered (#5445) should be more of a real issue.

My code constructs a conversion/validation closure for every field in 
the Schema, based on a SchemaField definition for each field. My 
SchemaFieldDecimal class includes precision and scale parameters, and 
now I'm going to add a rounding parameter, with None meaning raise an 
error on Inexact.

So pseudo-code for the fix:

scale = None if scale is None else Decimal((0,(1,),-scale))
traps = (InvalidOperation, Inexact) if rounding is None else 
(InvalidOperation,)
context = Context(prec=precision, rounding=rounding, traps=traps)

doing the conversion/validation:

For case if scale is not None:

try:
with context:
value = Decimal(value).quantize(scale)
except handlers...

For case if scale is None:

try:
   with context:
   value = Decimal(value)+Decimal(0) # will round or raise Inexact
except handlers...

--
message_count: 2.0 -> 3.0

___
Python tracker 
<http://bugs.python.org/issue5448>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5445] codecs.StreamWriter.writelines problem when passed generator

2009-03-10 Thread Daniel Lescohier

Daniel Lescohier  added the comment:

In Python's file protocol, readlines and writelines is a protocol for 
iterating over a file. In Python's file protocol, if one doesn't want 
to iterate over the file, one calls read() with no argument in order 
to read the whole file in, or one calls write() with the complete 
contents you want to write.

If writelines is using join, then if one passes an iterator as the 
parameter to writelines, it will not iteratively write to the file, it 
will accumulate everything in memory until the iterator raises 
StopIteration, and then write to the file.  So, if one is tailing the 
output file, one is not going to see anything in the file until the 
end, instead of iteratively seeing content.  So, it's breaking the 
promise of the file protocol's writelines meaning iteratively write.

I think following the protocol is more important than performance. If 
the application is having performance problems, it's up to the 
application to buffer the data in memory and make a single write call.

However, here is an alternative implementation that is slightly more 
complicated, but possibly has better performance for the passed-a-list 
case.  It covers three cases:

1. Passed an empty sequence; do not call self.write at all.
2. Passed a sequence with a length. That implies that all the data is 
available immediately, so one can concantenate and write with one 
self.write call.
3. Passed a sequence with no length.  That implies that all the data 
is not available immediately, so iteratively write it.

def writelines(self, sequence):

""" Writes the sequence of strings to the stream
using .write().
"""
try:
sequence_len = len(sequence)
except TypeError:
write = self.write
for value in sequence:
write(value)
return
if sequence_len:
self.write(''.join(sequence))

I'm not sure which is better.  But one last point is that Python is 
moving more in the direction of using iterators; e.g., in Py3K, 
replacing dict's keys, values, and items with the implementation of 
iterkeys, itervalues, and iteritems.

--
message_count: 3.0 -> 4.0

___
Python tracker 
<http://bugs.python.org/issue5445>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com




[issue5445] codecs.StreamWriter.writelines problem when passed generator

2009-03-10 Thread Daniel Lescohier

Daniel Lescohier  added the comment:

Let me give an example of why it's important that writelines 
iteratively writes.  For:

rows = (line[:-1].split('\t') for line in in_file)
projected = (keep_fields(row, 0, 3, 7) for row in rows)
filtered = (row for row in projected if row[2]=='1')
out_file.writelines('\t'.join(row)+'\n' for row in filtered)

For a large input file, for a regular out_file object, this will work. 
For a codecs.StreamWriter wrapped out_file object, this won't work, 
because it's not following the file protocol that writelines should 
iteratively write.

--
message_count: 4.0 -> 5.0

___
Python tracker 
<http://bugs.python.org/issue5445>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5445] codecs.StreamWriter.writelines problem when passed generator

2009-03-10 Thread Daniel Lescohier

Daniel Lescohier  added the comment:

OK, I think I see where I went wrong in my perceptions of the file 
protocol.  I thought that readlines() returned an iterator, not a 
list, but I see in the library reference manual on File Objects that 
it returns a list.  I think I got confused because there is no 
equivalent of __iter__ for writing to streams.  For input, I'm always 
using 'for line in file_object' (in other words, 
file_object.__iter__), so I had assumed that writelines was the mirror 
image of that, because I never use the readlines method.  Then, in my 
mind, readlines became the mirror image of writelines, which I had 
assumed took an iterator, so I assumed that readlines returned an 
iterator.  I wonder if this perception problem is common or not.

So, the StreamWriter interface matches the file protocol; readlines() 
and writelines() deal with lists.  There shouldn't be any change to 
it, because it follows the protocol.

Then, the example I wrote would be instead:

rows = (line[:-1].split('\t') for line in in_file)
projected = (keep_fields(row, 0, 3, 7) for row in rows)
filtered = (row for row in projected if row[2]=='1')
formatted = (u'\t'.join(row)+'\n' for row in filtered)
write = out_file.write
for line in formatted:
write(line)

I think it's correct that the file object write C code only does 
1000-line chunks for sequences that have a defined length: if it has a 
defined length, then that implies that the data exists now, and can be 
concatenated and written now.  Something without a defined length may 
be a generator with items arriving later.

--
message_count: 6.0 -> 7.0

___
Python tracker 
<http://bugs.python.org/issue5445>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com