[issue17852] Built-in module _io can lose data from buffered files at exit

2018-01-24 Thread Nathaniel Smith

Nathaniel Smith  added the comment:

> I attempted to implement my weakref idea (i.e. raw file keeps a weakref to 
> the buffered file, calls flush before the raw file gets closed).  That 
> doesn't work either because the GC clears the weakref before calling __del__.

This may be a bit of a left-field or too-big-a-hammer suggestion, but as far I 
can tell from this thread [1] it probably is technically possible to modify the 
GC to clear weakrefs after calling __del__. Nick wasn't a fan (he likes the 
invariant that weakrefs can't trigger a resurrection), but if all else fails it 
might be worth re-raising.

You could add a secondary reference count on FileIO recording how many 
BufferedIO wrappers there are around it; then it's __del__ would skip calling 
close() if there are still BufferedIO wrappers, and BufferedIO.__del__ would 
decrement the reference count and close the underlying file if it hits zero and 
FileIO.__del__ had already been called.

[1] https://mail.python.org/pipermail/python-dev/2016-October/146747.html

--
nosy: +njs

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2018-01-24 Thread Neil Schemenauer

Neil Schemenauer  added the comment:

Using atexit is not the solution because the data can be lost even while the 
program is running, not just at interpreter shutdown.  The problem occurs if 
the buffered file object and the underlying file object are both part of a 
reference cycle.  Then, when the cycle is destroyed by the GC module, if the 
file __del__ is called before the buffer __del__, data is lost.

So far, the best solution I've come up with is as follows:  split the buffered 
file object into two objects, a wrapper object that will be returned from 
open() and an underlying object that holds the actual buffer.  Make the 
underlying file object keep references to all the buffers that are using the 
file.  When the file _del__ gets called, first flush all of the buffers and 
then close the file.  Splitting the buffered file object is required so that we 
don't create reference cycles.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2018-01-24 Thread Arusekk

Arusekk  added the comment:

Since the issue seems to have been active lately, may I suggest my view on 
solving it.

One solution that comes to my mind is to keep a weak reference to the file, and 
to register an atexit function per file (and to unregister it when the file is 
closed). Example concept-illustrating python code for the _pyio module:

import atexit, weakref

# ...

class TextIOWrapper(TextIOBase):
def __init__(self):
# ...
self._weakclose = operator.methodcaller('close')
atexit.register(self._weakclose, weakref.proxy(self))

# ...

def close(self):
atexit.unregister(self._weakclose)
# and now actually close the file

There is a possibility of a negative impact arising from the use of 
operator.methodcaller, because it may return something cached in future 
versions of python. There is also the issue of unregistering an atexit function 
during atexit processing. But the general idea seems to be simple and worth 
considering.

--
nosy: +Arusekk

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-20 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

Here's another idea: have a new type field (a tp_something) that exposes the GC 
priority for this type.  It could be an integer between -8 and 7, for example 
(0 being the default).  Then tp_finalize would be called in sorted priority 
order.

The tricky part of course is to make this acceptable performance-wise (i.e. 
avoid walking the collectable list 16 times, once for each priority level).  
For example by creating bucket lists for each priority level (only for objects 
with a non-NULL tp_finalize).

--
stage: patch review -> needs patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-19 Thread Neil Schemenauer

Neil Schemenauer  added the comment:

Yeah, I think you are correct.  Currently files not part of reference cycles 
get properly flushed and closed by the reference counter.  Implementing my 
"buffer_register_flush" patch would cause files to be closed only by the cyclic 
garbage collector (if not explicitly closed).  That would mean a script that 
opens a large number of files could run out of file descriptors.  Basically, we 
lose one of the main advantages of reference counting.

Probably the GC should keep track of how many files are open and call collect() 
when a threshold is reached.  Still, forcing every file to be collected only by 
the cyclic GC seems too ugly of a solution to this issue.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-19 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

I think it would be quite disruptive to create a reference cycle each time 
open() is called.  It may also break user scripts.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-19 Thread Neil Schemenauer

Neil Schemenauer  added the comment:

Welp, another day another attempt.  As mentioned in the PR 4847, atexit is not 
the answer.  If the raw/buffered file pair are part of a reference cycle and 
the GC cleans it before atexit runs, then the buffered data can get lost.

I attempted to implement my weakref idea (i.e. raw file keeps a weakref to the 
buffered file, calls flush before the raw file gets closed).  That doesn't work 
either because the GC clears the weakref before calling __del__.

The attached patch "buffer_register_flush.txt" does seem to work.  The downside 
is that it creates a reference cycle between the raw and buffered file objects. 
 Perhaps that is not a problem since unless you call close() on the raw file, 
you will be leaking resources anyhow.  In the patch, calling close() removes 
the reference cycle.

I still feel like this is worth fixing, as ugly as the implementation is.

--
assignee:  -> nascheme
Added file: https://bugs.python.org/file47340/buffer_register_flush.txt

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-15 Thread Nikolaus Rath

Nikolaus Rath  added the comment:

Given the apparent difficulties getting this right, how about not attempting to 
implicitly flush buffers in the finalizer at all? This means scripts relying on 
this will break, but in contrast to the current behavior they will break 
consistently so it's easy to find and fix the problem.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-15 Thread Nitish

Change by Nitish :


--
nosy: +nitishch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-15 Thread Neil Schemenauer

Neil Schemenauer  added the comment:

Using reversed chronological order would work in 99% of the cases probably but 
then you would have that rare case where it didn't work.  So, I'm not too keen 
on that approach.

I think this is a classic problem with finalization and GC, probably should do 
some actual reading of literature.  One idea that occurs to me is that the 
underlying file object could keep a weakref to the higher level (e.g. buffer) 
objects.  Then, if the underlying file gets closes, it could make sure to call 
flush on the higher level objects first.  The weakref would ensure that a ref 
cycle is not created.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-15 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

I'm not well known with GC, but if it be possible to call destructors in the 
reversed chronological order, this could help with this issue.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-15 Thread Neil Schemenauer

Neil Schemenauer  added the comment:

In the process of trying to write a test for this, I now realize that PR 4847 
is not really a fix.  If the underlying file gets closed during an earlier 
gc.collect() and not during shutdown, the extra flushing step is not going to 
help.  So, using atexit is not really a fix.  Maybe there is something else we 
can do, need to think about it more.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-14 Thread Armin Rigo

Change by Armin Rigo :


--
nosy:  -arigo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-14 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

The reason Python 2 did well here is simply that Python 2 had a single Python 
object (the file object) for each actual file.  Python 3 has several of them 
(the raw IO object, the buffered IO object, possibly the text IO wrapper), and 
so suddenly the finalization order matters.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-14 Thread Neil Schemenauer

Neil Schemenauer  added the comment:

Attached is a script that triggers the non-flushing behaviour for me.  I don't 
think it is reliable since it depends on the order that FileIO AND 
BufferedWriter are finalized when the gc finds them in a reference cycle.

BTW, it is arguable that the root cause of this bug is that we no longer to 
topological ordering when calling finalizers.  Originally, the cycle GC would 
refuse to call __del__ methods because once they are part of a cycle, there is 
no defined topological ordering.  So, we linked that garage to gc.garbage 
rather than calling __del__ and have potentially undefined results.  Now the GC 
has been changed to call __del__ anyhow.   I haven't studied how this has been 
changed but this non-flushing bug is a result.  Having the buffer added to 
gc.garbage would also result in the data not being flushed but arguably it 
would be more understandable what's going on.  I'm not arguing that we should 
go back to that, just that current behaviour can be subtle and confusing.

--
Added file: https://bugs.python.org/file47332/buffer_not_flushed.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-13 Thread Neil Schemenauer

Neil Schemenauer  added the comment:

I created a new PR which uses the atexit module instead of using _Py_PyAtExit.  
I think registering in io.py is okay.

I see that atexit is now implemented in C.  Rather than registering in io.py, 
we could create a C API to register callbacks (i.e. atexit_register).  That 
would perhaps be a bit cleaner.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-13 Thread Neil Schemenauer

Change by Neil Schemenauer :


--
pull_requests: +4735
stage: needs patch -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-12 Thread STINNER Victor

STINNER Victor  added the comment:


New changeset 317def9fdb29893df1ab380d396fcdd2eafe0588 by Victor Stinner 
(Antoine Pitrou) in branch 'master':
bpo-17852: Revert incorrect fix based on misunderstanding of _Py_PyAtExit() 
semantics (#4826)
https://github.com/python/cpython/commit/317def9fdb29893df1ab380d396fcdd2eafe0588


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-12 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

I just discovered that the fix is incorrect. See PR #4826 for reversion and a 
quick explanation.

--
resolution: fixed -> 
stage: resolved -> needs patch
status: closed -> open

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-12-12 Thread Antoine Pitrou

Change by Antoine Pitrou :


--
pull_requests: +4718

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-11-06 Thread Neil Schemenauer

Neil Schemenauer  added the comment:

Yes, my bad.  I thought that accepting the pull would close the bug.

--
resolution:  -> fixed
stage: backport needed -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-11-06 Thread STINNER Victor

STINNER Victor  added the comment:

Neil pushed the commit 0a1ff24acfc15d8c7f2dc41000a6f3d9a31e7480. What is the 
status of this issue? Can we now close it?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-11-04 Thread Berker Peksag

Change by Berker Peksag :


--
stage: needs patch -> backport needed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-09-22 Thread Neil Schemenauer

Neil Schemenauer added the comment:


New changeset 0a1ff24acfc15d8c7f2dc41000a6f3d9a31e7480 by Neil Schemenauer in 
branch 'master':
bpo-17852: Maintain a list of BufferedWriter objects.  Flush them on exit. 
(#3372)
https://github.com/python/cpython/commit/0a1ff24acfc15d8c7f2dc41000a6f3d9a31e7480


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-09-05 Thread Neil Schemenauer

Changes by Neil Schemenauer :


--
pull_requests: +3382, 3383

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-09-05 Thread Neil Schemenauer

Changes by Neil Schemenauer :


--
pull_requests: +3382

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17852] Built-in module _io can lose data from buffered files at exit

2017-09-05 Thread Raymond Hettinger

Changes by Raymond Hettinger :


--
title: Built-in module _io can loose data from buffered files at exit -> 
Built-in module _io can lose data from buffered files at exit

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com