[issue5445] codecs.StreamWriter.writelines problem when passed generator
New submission from Daniel Lescohier : This is the implementation of codecs.Streamwriter.writelines for all Python versions that I've checked: def writelines(self, list): """ Writes the concatenated list of strings to the stream using .write(). """ self.write(''.join(list)) This may be a problem if the 'list' parameter is a generator. The generator may be returning millions of values, which the join will concatenate in memory. It can surprise the programmer with large memory use. I think join should only be used when you know the size of your input, and this method does not know this. I think the safe implementation of this method would be: def writelines(self, list): """ Writes the concatenated list of strings to the stream using .write(). """ write = self.write for value in list: write(value) If a caller knows that it's input list would use a reasonable amount of memory, it can get the same functionality as before by doing stream.write(''.join(list)). -- components: Library (Lib) message_count: 1.0 messages: 83322 nosy: dlesco nosy_count: 1.0 severity: normal status: open title: codecs.StreamWriter.writelines problem when passed generator versions: Python 2.5, Python 2.6, Python 2.7, Python 3.0 ___ Python tracker <http://bugs.python.org/issue5445> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5448] Add precision property to decimal.Decimal
New submission from Daniel Lescohier : I would like to get the decimal precision of decimal.Decimal objects (for my application, in order to validate whether a Decimal value will fit in the defined decimal precision of a database field). The way I found to get it was with: precision = len(value._int) where value is a decimal.Decimal object. However, this uses a private API (_int). I would like to have a public API to get the decimal precision of a Decimal. I propose we add the following to the decimal.Decimal class: @property def precision(self): """The precision of this Decimal value.""" return len(self._int) decimal.Context has a precision for calculations. decimal.Decimal.precision is the minimum precision needed to represent that value, not the precision that was used in calculating it. If one wants to, one can actually use Decimal.precision to set your Context's precision: d1 = decimal.Decimal('999') d2 = d1 context.prec = d1.precision + d2.precision d3 = d1 * d2 Open for debate is whether to name it Decimal.prec to mirror Context.prec. We'd have to choose one or the other or both: @property def precision(self): """The precision of this Decimal value.""" return len(self._int) prec = precision -- components: Library (Lib) message_count: 1.0 messages: 83328 nosy: dlesco nosy_count: 1.0 severity: normal status: open title: Add precision property to decimal.Decimal versions: Python 2.4, Python 2.5, Python 2.6, Python 2.7, Python 3.0, Python 3.1 ___ Python tracker <http://bugs.python.org/issue5448> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5448] Add precision property to decimal.Decimal
Daniel Lescohier added the comment: I had other code to check scale, but you are right, I should use quantize. There is certainly a lot to absorb in the IBM decimal specification. I really appreciate you pointing me to quantize and Inexact. I guess I inadvertently used the issue tracker for help on the decimal module, I didn't really mean to do that, I really thought there was a need for Decimal.precision. The other unrelated issue I entered (#5445) should be more of a real issue. My code constructs a conversion/validation closure for every field in the Schema, based on a SchemaField definition for each field. My SchemaFieldDecimal class includes precision and scale parameters, and now I'm going to add a rounding parameter, with None meaning raise an error on Inexact. So pseudo-code for the fix: scale = None if scale is None else Decimal((0,(1,),-scale)) traps = (InvalidOperation, Inexact) if rounding is None else (InvalidOperation,) context = Context(prec=precision, rounding=rounding, traps=traps) doing the conversion/validation: For case if scale is not None: try: with context: value = Decimal(value).quantize(scale) except handlers... For case if scale is None: try: with context: value = Decimal(value)+Decimal(0) # will round or raise Inexact except handlers... -- message_count: 2.0 -> 3.0 ___ Python tracker <http://bugs.python.org/issue5448> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5445] codecs.StreamWriter.writelines problem when passed generator
Daniel Lescohier added the comment: In Python's file protocol, readlines and writelines is a protocol for iterating over a file. In Python's file protocol, if one doesn't want to iterate over the file, one calls read() with no argument in order to read the whole file in, or one calls write() with the complete contents you want to write. If writelines is using join, then if one passes an iterator as the parameter to writelines, it will not iteratively write to the file, it will accumulate everything in memory until the iterator raises StopIteration, and then write to the file. So, if one is tailing the output file, one is not going to see anything in the file until the end, instead of iteratively seeing content. So, it's breaking the promise of the file protocol's writelines meaning iteratively write. I think following the protocol is more important than performance. If the application is having performance problems, it's up to the application to buffer the data in memory and make a single write call. However, here is an alternative implementation that is slightly more complicated, but possibly has better performance for the passed-a-list case. It covers three cases: 1. Passed an empty sequence; do not call self.write at all. 2. Passed a sequence with a length. That implies that all the data is available immediately, so one can concantenate and write with one self.write call. 3. Passed a sequence with no length. That implies that all the data is not available immediately, so iteratively write it. def writelines(self, sequence): """ Writes the sequence of strings to the stream using .write(). """ try: sequence_len = len(sequence) except TypeError: write = self.write for value in sequence: write(value) return if sequence_len: self.write(''.join(sequence)) I'm not sure which is better. But one last point is that Python is moving more in the direction of using iterators; e.g., in Py3K, replacing dict's keys, values, and items with the implementation of iterkeys, itervalues, and iteritems. -- message_count: 3.0 -> 4.0 ___ Python tracker <http://bugs.python.org/issue5445> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5445] codecs.StreamWriter.writelines problem when passed generator
Daniel Lescohier added the comment: Let me give an example of why it's important that writelines iteratively writes. For: rows = (line[:-1].split('\t') for line in in_file) projected = (keep_fields(row, 0, 3, 7) for row in rows) filtered = (row for row in projected if row[2]=='1') out_file.writelines('\t'.join(row)+'\n' for row in filtered) For a large input file, for a regular out_file object, this will work. For a codecs.StreamWriter wrapped out_file object, this won't work, because it's not following the file protocol that writelines should iteratively write. -- message_count: 4.0 -> 5.0 ___ Python tracker <http://bugs.python.org/issue5445> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5445] codecs.StreamWriter.writelines problem when passed generator
Daniel Lescohier added the comment: OK, I think I see where I went wrong in my perceptions of the file protocol. I thought that readlines() returned an iterator, not a list, but I see in the library reference manual on File Objects that it returns a list. I think I got confused because there is no equivalent of __iter__ for writing to streams. For input, I'm always using 'for line in file_object' (in other words, file_object.__iter__), so I had assumed that writelines was the mirror image of that, because I never use the readlines method. Then, in my mind, readlines became the mirror image of writelines, which I had assumed took an iterator, so I assumed that readlines returned an iterator. I wonder if this perception problem is common or not. So, the StreamWriter interface matches the file protocol; readlines() and writelines() deal with lists. There shouldn't be any change to it, because it follows the protocol. Then, the example I wrote would be instead: rows = (line[:-1].split('\t') for line in in_file) projected = (keep_fields(row, 0, 3, 7) for row in rows) filtered = (row for row in projected if row[2]=='1') formatted = (u'\t'.join(row)+'\n' for row in filtered) write = out_file.write for line in formatted: write(line) I think it's correct that the file object write C code only does 1000-line chunks for sequences that have a defined length: if it has a defined length, then that implies that the data exists now, and can be concatenated and written now. Something without a defined length may be a generator with items arriving later. -- message_count: 6.0 -> 7.0 ___ Python tracker <http://bugs.python.org/issue5445> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com