[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Richard Oudkerk added the comment: On 32 bit Unix mmap() will raise ValueError(mmap length is too large) in Marc's example. This is correct since Python's sequence protocol does not support indexes larger than sys.maxsize. But on 32 bit Windows, if length == 0 then the size check always passes, and the actual size mapped is the file size modulo 4GB. Fix for 3.x is attached with tests. -- keywords: +patch stage: needs patch - patch review Added file: http://bugs.python.org/file28444/mmap.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Serhiy Storchaka added the comment: This change is not backward compatible. Now user can mmap a larger file and safely access lower 2 GiB. With the patch it will fail. Unix implementation uses unsafe integer overflow idiom which cause undefined behavior (Mark, you have the floor). -- nosy: +mark.dickinson ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Richard Oudkerk added the comment: This change is not backward compatible. Now user can mmap a larger file and safely access lower 2 GiB. With the patch it will fail. They should specify length=2GiB-1 if that is what they want. With length=0 you can only access the lower 2GiB if file_size % 4GiB 2GiB. If the file size is 4GiB+1 then you can only access *one byte* of the file. And if 2GiB file_size 4GiB then presumably len(data) will be negative (or throw an exception or fail an assertion -- I have not tested that case). I would not be surprised if crashes are possible. Basically if you had a large file and you did not hit a problem then it was Windows specific dumb luck. I see no point in retaining such unpredictable behaviour. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Serhiy Storchaka added the comment: Agree. Please add the same check for Unix implementation (instead of unsafe overflow trick). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Richard Oudkerk added the comment: New patch with same check for Unix. -- Added file: http://bugs.python.org/file28446/mmap.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Serhiy Storchaka added the comment: LGTM. Isn't 2 GiB + 1 bytes mmap file enough for testing? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Richard Oudkerk added the comment: Isn't 2 GiB + 1 bytes mmap file enough for testing? Yes. But creating multigigabyte files is very slow on Windows. On Linux/FreeBSD test_mmap takes a fraction of a second, whereas on Windows it takes over 2 minutes. (Presumably Linux/FreeBSD is automatically creating a sparse file.) So adding assertions to an existing test is more convenient than creating another huge file just for these new tests. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Terry J. Reedy added the comment: Windows memory-maps multi-gigabyte files just fine as long as one uses the proper build (64-bit), which we provide. Given that mmap produces a finite-length sequence object, as documented, slicing is working as it should. Slicing beyond the length returns an empty sequence. The is no different from 'abc'[4:6]==''. Running Python with finite memory has many memory-associated limitations. They are mostly undocumented as the exact details may depend on hardware, OS, implementation, version, and build. One practical limitation is that mmap with a 32-bit build cannot completely map multi-gigabyte files. The current doc says: class mmap.mmap(fileno, length, tagname=None, access=ACCESS_DEFAULT[, offset]) (Windows version) Maps length bytes from the file specified by the file handle fileno, and creates a mmap object. If length is larger than the current size of the file, the file is extended to contain length bytes. If length is 0, the maximum length of the map is the current size of the file, except that if the file is empty Windows raises an exception (you cannot create an empty mapping on Windows). It does not say what happens if the requested length is larger than the max possible on a particular system. In particular, there is no mention of exception raising. So failure to raise is not a bug for tracker purposes. The two possibilities of what to do is such situations are best effort and bailout. The current choice (at least on Windows, and whether by us, Microsoft, or the original mmap authors, I don't know) is best effort. I think that is fine, but should be documented. Users who care can compare the mmap object length with the file length or needed length and raise or do whatever if the mmap length is too short. So I think we should change this to a doc issue and add something like If the requested length is larger than the limit for the current system, then that limit is used as the length. or The length of the returned mmap object has a limit that depends on the details of the running system. Or the header should say that there is a system limit and two of the sentences above revised. In the first, change 'length bytes' to 'min(length, system limit) bytes. (I am presuming this is true also when length is not given as 0.) In the last sentence, change 'current size' to 'min(current size, system limit)'. The Unix version doc should also clarify behavior. --- If we were to change mmap() (but only in a future release), then users who want the current behavior would have to discover, hard-code, and explicitly but conditionally pass the limit for each system their code might ever run on. I do not know that that is sensibly possible. I would not be surprised if the limit for a given 32-bit build varies for different windows versions and setups. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Richard Oudkerk added the comment: I suspect that the size of the 5GB file is originally a 64 bit quantity, but gets cast unsafely to a 32 bit size_t to give 1GB. This is causing the miscalculations. There is no way to map all of a 5GB file in a 32 bit process -- 4GB is the maximum -- so any such attempt should raise an error. This does not prevent us from mapping *part* of a 5GB file. -- nosy: +sbt ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Richard Oudkerk added the comment: This bit looks wrong to me: if (offset - size PY_SSIZE_T_MAX) /* Map area too large to fit in memory */ m_obj-size = (Py_ssize_t) -1; Should it not be size - offset instead of offset - size? (offset and size are Py_LONG_LONG.) And there is no check that offset is non-negative. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Antoine Pitrou added the comment: Terry, what makes you think this is a feature request? This is a bug, quite simply. -- nosy: +pitrou versions: +Python 2.7, Python 3.2, Python 3.3 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Terry J. Reedy added the comment: It is a report of behavior that lacks a specific request for change (that I can see). The implied code-change request could break working code. We don't usually do that. What do you think should be done? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Serhiy Storchaka added the comment: As I understand, the issue is that mmap slicing returns an empty string for large (but less than ssize_t limit) indices on 2.7. May be it relates to 30-bit digits long integer implementation? -- nosy: +serhiy.storchaka ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Terry J. Reedy added the comment: To me, Marc's title and penultimate sentence imply that he thinks that mmap should not accept such files. (But he should speak for himself.) As I said, not accepting such files could break working code. As for the alternative of 'fixing' methods: Is it only slicing or other methods, even *every* method that 'misbehaves' when attempting to read (or write) beyond the 1 gig limit? I am guessing the last. If so, just about every method (inherited from bytearray, like slicing, or mmap specific) would need a fix conditional on the build and access location (and OS or hardware?). Even for slices, what change would you (or anyone) make? Keep in mind that is it a *feature* of slices that they generally always work, and that this is specifically true of bytearrays. (Memory-mapped file objects behave like both bytearray and like file objects.) I am actually a bit surprised that the limit is 1 gb rather than 2, 3, or 4 gb. Is it the same on *nix? What is the limit for a bytearray on Win 7? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Serhiy Storchaka added the comment: I have only 32-bit OS and can't answer this questions. I'm surprised by 1 GiB limit too. Marc, can you please check 4.5 GiB file? What limit in this case, 1 GiB or 0.5 GiB? What about slicing a big bytes object or bytearray (if you have enough memory)? If mmap on Windows can't work with files larger 4 GiB, then it should raise exception on creation or at least on access. Silent production of wrong result is not a way. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
New submission from Marc Schlaich: Platform: Windows 7 64 bit Interpreter: Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit Intel)] on win32 Here are the steps to reproduce: 1. Create a big file (5 GB): with open('big', 'wb') as fobj: for _ in xrange(1024 * 1024 * 5): fobj.write('1' + '0' * 1023) 2. Open and process it with `mmap`: import mmap import re import sys with open('big', 'rb') as fobj: data = mmap.mmap(fobj.fileno(), 0, access=mmap.ACCESS_READ) print data.size() try: counter = 0 for match in re.finditer('1' + '0' * 1023, data): counter += 1 print len(data[1073740800:1073741824]) # (1 GB - 1024, 1 GB) print len(data[1073741824:1073742848]) # (1 GB, 1 GB + 1024) finally: data.close() print counter This returns the following lines: 5368709120 1024 0 1048576 So this is a behavioral issue. `mmap` accepts a file which cannot fit in the interpreter memory but fits in the system memory. On processing the data, it only reads data until the maximum interpreter memory is reached (1 GB). -- components: None messages: 177879 nosy: schlamar priority: normal severity: normal status: open title: mmap accepts files 1 GB, but processes only 1 GB type: behavior versions: Python 2.7 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Changes by Antoine Pitrou pit...@free.fr: -- components: +Library (Lib), Windows -None nosy: +brian.curtin, tim.golden versions: +Python 3.2, Python 3.3, Python 3.4 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16743] mmap accepts files 1 GB, but processes only 1 GB
Terry J. Reedy added the comment: The immediate fix is to use a 64 bit build. That aside, what change in behavior are you suggesting? (and for 32 bit builds only?) Should mmap.mmap warn if the file is longer that would be supported? This could be added to all current versions. Should it raise in the same circumstance? What is a person *knows* that the file is 'too big' but only wants to access the first gigabyte? Forcing people to explicitly pass the magic number 1073741824 would, to me, effectively be a 3.4-at-best api change. Perhaps mmap.mmap should be left alone and only the attempt to access beyond the cutoff should raise or warn. (Is the 32-bit cutoff OS specific?) Given that there are multiple access methods and methods that access, and that all accesses are ultimately delegated to the os mmap functions, this could be major nuisance to get right. Now that disks have grown to larger than a gigabyte, the doc should explicitly mention the memory space issue. -- nosy: +terry.reedy stage: - needs patch type: behavior - enhancement versions: -Python 2.7, Python 3.2, Python 3.3 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com