[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-26 Thread Richard Oudkerk

Richard Oudkerk added the comment:

On 32 bit Unix mmap() will raise ValueError(mmap length is too large) in 
Marc's example.  This is correct since Python's sequence protocol does not 
support indexes larger than sys.maxsize.

But on 32 bit Windows, if length == 0 then the size check always passes, and 
the actual size mapped is the file size modulo 4GB.

Fix for 3.x is attached with tests.

--
keywords: +patch
stage: needs patch - patch review
Added file: http://bugs.python.org/file28444/mmap.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-26 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

This change is not backward compatible. Now user can mmap a larger file and 
safely access lower 2 GiB. With the patch it will fail.

Unix implementation uses unsafe integer overflow idiom which cause undefined 
behavior (Mark, you have the floor).

--
nosy: +mark.dickinson

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-26 Thread Richard Oudkerk

Richard Oudkerk added the comment:

 This change is not backward compatible. Now user can mmap a larger file 
 and safely access lower 2 GiB. With the patch it will fail.

They should specify length=2GiB-1 if that is what they want.

With length=0 you can only access the lower 2GiB if file_size % 4GiB  2GiB.  
If the file size is 4GiB+1 then you can only access *one byte* of the file.  
And if 2GiB  file_size  4GiB then presumably len(data) will be negative (or 
throw an exception or fail an assertion -- I have not tested that case).  I 
would not be surprised if crashes are possible.

Basically if you had a large file and you did not hit a problem then it was 
Windows specific dumb luck.  I see no point in retaining such unpredictable 
behaviour.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-26 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Agree.

Please add the same check for Unix implementation (instead of unsafe overflow 
trick).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-26 Thread Richard Oudkerk

Richard Oudkerk added the comment:

New patch with same check for Unix.

--
Added file: http://bugs.python.org/file28446/mmap.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-26 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

LGTM. Isn't 2 GiB + 1 bytes mmap file enough for testing?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-26 Thread Richard Oudkerk

Richard Oudkerk added the comment:

 Isn't 2 GiB + 1 bytes mmap file enough for testing?

Yes.

But creating multigigabyte files is very slow on Windows.  On Linux/FreeBSD 
test_mmap takes a fraction of a second, whereas on Windows it takes over 2 
minutes.  (Presumably Linux/FreeBSD is automatically creating a sparse file.)

So adding assertions to an existing test is more convenient than creating 
another huge file just for these new tests.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-24 Thread Terry J. Reedy

Terry J. Reedy added the comment:

Windows memory-maps multi-gigabyte files just fine as long as one uses the 
proper build (64-bit), which we provide.

Given that mmap produces a finite-length sequence object, as documented, 
slicing is working as it should. Slicing beyond the length  returns an empty 
sequence. The is no different from 'abc'[4:6]==''.

Running Python with finite memory has many memory-associated limitations. They 
are mostly undocumented as the exact details may depend on hardware, OS, 
implementation, version, and build. One practical limitation is that mmap with 
a 32-bit build cannot completely map multi-gigabyte files.

The current doc says:
class mmap.mmap(fileno, length, tagname=None, access=ACCESS_DEFAULT[, offset]) 
(Windows version) Maps length bytes from the file specified by the file handle 
fileno, and creates a mmap object. If length is larger than the current size of 
the file, the file is extended to contain length bytes. If length is 0, the 
maximum length of the map is the current size of the file, except that if the 
file is empty Windows raises an exception (you cannot create an empty mapping 
on Windows).

It does not say what happens if the requested length is larger than the max 
possible on a particular system. In particular, there is no mention of 
exception raising. So failure to raise is not a bug for tracker purposes.

The two possibilities of what to do is such situations are best effort and 
bailout. The current choice (at least on Windows, and whether by us, Microsoft, 
or the original mmap authors, I don't know) is best effort. I think that is 
fine, but should be documented. Users who care can compare the mmap object 
length with the file length or needed length and raise or do whatever if the 
mmap length is too short.

So I think we should change this to a doc issue and add something like If the 
requested length is larger than the limit for the current system, then that 
limit is used as the length.
or
The length of the returned mmap object has a limit that depends on the details 
of the running system.

Or the header should say that there is a system limit and two of the sentences 
above revised. In the first, change 'length bytes' to 'min(length, system 
limit) bytes. (I am presuming this is true also when length is not given as 0.) 
In the last sentence, change 'current size' to 'min(current size, system 
limit)'.

The Unix version doc should also clarify behavior.
---

If we were to change mmap() (but only in a future release), then users who want 
the current behavior would have to discover, hard-code, and explicitly but 
conditionally pass the limit for each system their code might ever run on. I do 
not know that that is sensibly possible. I would not be surprised if the limit 
for a given 32-bit build varies for different windows versions and setups.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-24 Thread Richard Oudkerk

Richard Oudkerk added the comment:

I suspect that the size of the 5GB file is originally a 64 bit quantity, but 
gets cast unsafely to a 32 bit size_t to give 1GB.  This is causing the 
miscalculations.

There is no way to map all of a 5GB file in a 32 bit process -- 4GB is the 
maximum -- so any such attempt should raise an error.  This does not prevent us 
from mapping *part* of a 5GB file.

--
nosy: +sbt

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-24 Thread Richard Oudkerk

Richard Oudkerk added the comment:

This bit looks wrong to me:

if (offset - size  PY_SSIZE_T_MAX)
/* Map area too large to fit in memory */
m_obj-size = (Py_ssize_t) -1;

Should it not be size - offset instead of offset - size?  (offset and size 
are Py_LONG_LONG.)  And there is no check that offset is non-negative.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-23 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Terry, what makes you think this is a feature request? This is a bug, quite 
simply.

--
nosy: +pitrou
versions: +Python 2.7, Python 3.2, Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-23 Thread Terry J. Reedy

Terry J. Reedy added the comment:

It is a report of behavior that lacks a specific request for change (that I can 
see). The implied code-change request could break working code. We don't 
usually do that. What do you think should be done?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-23 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

As I understand, the issue is that mmap slicing returns an empty string for 
large (but less than ssize_t limit) indices on 2.7.

May be it relates to 30-bit digits long integer implementation?

--
nosy: +serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-23 Thread Terry J. Reedy

Terry J. Reedy added the comment:

To me, Marc's title and penultimate sentence imply that he thinks that mmap 
should not accept such files. (But he should speak for himself.) As I said, not 
accepting such files could break working code.

As for the alternative of 'fixing' methods: Is it only slicing or other 
methods, even *every* method that 'misbehaves' when attempting to read (or 
write) beyond the 1 gig limit? I am guessing the last. If so, just about every 
method (inherited from bytearray, like slicing, or mmap specific) would need a 
fix conditional on the build and access location (and OS or hardware?).

Even for slices, what change would you (or anyone) make? Keep in mind that is 
it a *feature* of slices that they generally always work, and that this is 
specifically true of bytearrays. (Memory-mapped file objects behave like both 
bytearray and like file objects.) 

I am actually a bit surprised that the limit is 1 gb rather than 2, 3, or 4 gb. 
Is it the same on *nix? What is the limit for a bytearray on Win 7?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-23 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I have only 32-bit OS and can't answer this questions. I'm surprised by
1 GiB limit too.

Marc, can you please check 4.5 GiB file? What limit in this case, 1 GiB
or 0.5 GiB? What about slicing a big bytes object or bytearray (if you
have enough memory)?

If mmap on Windows can't work with files larger 4 GiB, then it should
raise exception on creation or at least on access. Silent production of
wrong result is not a way.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-21 Thread Marc Schlaich

New submission from Marc Schlaich:

Platform: Windows 7 64 bit
Interpreter: Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit 
Intel)] on win32

Here are the steps to reproduce:

1. Create a big file (5 GB):

with open('big', 'wb') as fobj:
for _ in xrange(1024 * 1024 * 5):
fobj.write('1' + '0' * 1023)

2. Open and process it with `mmap`:

import mmap
import re
import sys

with open('big', 'rb') as fobj:
data = mmap.mmap(fobj.fileno(), 0, access=mmap.ACCESS_READ)
print data.size()
try:
counter = 0
for match in re.finditer('1' + '0' * 1023, data):
counter += 1
print len(data[1073740800:1073741824]) # (1 GB - 1024, 1 GB)
print len(data[1073741824:1073742848]) # (1 GB, 1 GB + 1024)
finally:
data.close()

print counter

This returns the following lines:

5368709120
1024
0
1048576

So this is a behavioral issue. `mmap` accepts a file which cannot fit in the 
interpreter memory but fits in the system memory. On processing the data, it 
only reads data until the maximum interpreter memory is reached (1 GB).

--
components: None
messages: 177879
nosy: schlamar
priority: normal
severity: normal
status: open
title: mmap accepts files  1 GB, but processes only 1 GB
type: behavior
versions: Python 2.7

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-21 Thread Antoine Pitrou

Changes by Antoine Pitrou pit...@free.fr:


--
components: +Library (Lib), Windows -None
nosy: +brian.curtin, tim.golden
versions: +Python 3.2, Python 3.3, Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16743] mmap accepts files 1 GB, but processes only 1 GB

2012-12-21 Thread Terry J. Reedy

Terry J. Reedy added the comment:

The immediate fix is to use a 64 bit build. That aside, what change in behavior 
are you suggesting? (and for 32 bit builds only?)

Should mmap.mmap warn if the file is longer that would be supported?
This could be added to all current versions.

Should it raise in the same circumstance? What is a person *knows* that the 
file is 'too big' but only wants to access the first gigabyte? Forcing people 
to explicitly pass the magic number 1073741824 would, to me, effectively be a 
3.4-at-best api change.

Perhaps mmap.mmap should be left alone and only the attempt to access beyond 
the cutoff should raise or warn. (Is the 32-bit cutoff OS specific?) Given that 
there are multiple access methods and methods that access, and that all 
accesses are ultimately delegated to the os mmap functions, this could be major 
nuisance to get right.

Now that disks have grown to larger than a gigabyte, the doc should explicitly 
mention the memory space issue.

--
nosy: +terry.reedy
stage:  - needs patch
type: behavior - enhancement
versions:  -Python 2.7, Python 3.2, Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com