Hi Cameron, Thanks for playing around and hinted about the 8192 bound. I got my question figured out, with your and Peter's help (Please read my reply to Peter). Cheers,Nancy
From: Cameron Simpson <c...@zip.com.au> To: Nancy Pham-Nguyen <nancyp...@sbcglobal.net> Cc: "tutor@python.org" <tutor@python.org> Sent: Tuesday, June 6, 2017 2:12 AM Subject: Re: [Tutor] f.readlines(size) On 05Jun2017 21:04, Nancy Pham-Nguyen <nancyp...@sbcglobal.net> wrote: >I'm trying to understand the optional size argument in file.readlines method. >The help(file) shows: > | readlines(...) | readlines([size]) -> list of strings, each a line >from the file. | | Call readline() repeatedly and return a list of >the lines so read. | The optional size argument, if given, is an >approximate bound on the | total number of bytes in the lines returned. >From the documentation:f.readlines() returns a list containing all the lines >of data in the file. >If given an optional parameter sizehint, it reads that many bytes from the >file >and enough more to complete a line, and returns the lines from that. >This is often used to allow efficient reading of a large file by lines, >but without having to load the entire file in memory. Only complete lines >will be returned. >I wrote the function below to try it, thinking that it would print multiple >times, 3 lines at a time, but it printed all in one shot, just like when I >din't specify the optional argument. Could someone explain what I've missed? >See input file and output below. I'm using this to test: from __future__ import print_function import sys lines = sys.stdin.readlines(1023) print(len(lines)) print(sum(len(_) for _ in lines)) print(repr(lines)) I've fed it a 41760 byte input (the size isn't important except that it needs to be "big enough"). The output starts like this: 270 8243 and then the line listing. That 8243 looks interesting, being close to 8192, a power of 2. The documentation you quote says: The optional size argument, if given, is an approximate bound on the total number of bytes in the lines returned. [...] it reads that many bytes from the file and enough more to complete a line, and returns the lines from that. It looks to me like readlines uses the sizehint somewhat liberally; the purpose as described in the doco is to read input efficiently without using an unbounded amount of memory. Imagine feeding readlines() a terabyte input file, without the sizehint. It would try to pull it all into memory. With the sizehint you get a simple form of batching of the input into smallish groups of lines. I would say, from my experiments here, that the underlying I/O is doing 8192 byte reads from the file as the default buffer. So although I've asked for 1023 bytes, readlines says something like: I want at least 1023 bytes; the I/O system loads 8192 bytes because that is its normal read size, then readlines picks up all the buffer. It does this so as to gather as many lines as readily available. It then asks for more data to complete the last line. The last line of my readlines() result is: %.class: %.java %.class-prereqs : $(("%.class-prereqs" G?<P)).class which is 68 bytes long including the newline character. 8192 + 68 = 8260, just over the 8243 bytes of "complete lines" I got back. So this sizehint is just a clue, and does not change the behaviour of the underlying I/O. It just prevents readlines() reading the entire file. If you want tighter control, may I suggest iterating over the file like this: for line in sys.stdin: ... do stuff with the line ... This also does not change the underlying I/O buffer size, but it does let you gather only the lines you want: you can count line lengths or numbers or whatever criteria you find useful if you want to stop be fore the end of the file. Cheers, Cameron Simpson <c...@zip.com.au> _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor