Re: [Python-ideas] Support parsing stream with `re`

2018-10-12 Thread James Lu
The file system is really just a b-tree. If you’re concerned about using memory, you can implement a O(log n) map using the file system, where the entires are the different critical sections.Every node is a folder and every file is a leaf. Many package managers implement maps like this. I’d like

Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Stephen J. Turnbull
Chris Angelico writes: > On Wed, Oct 10, 2018 at 5:09 AM Stephen J. Turnbull > wrote: > > > > Chris Angelico writes: > > > > > Both processes are using the virtual memory. Either or both could be > > > using physical memory. Assuming they haven't written to the pages > > > (which is

Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Cameron Simpson
On 10Oct2018 00:42, Stephen J. Turnbull wrote: Chris Angelico writes: > On Tue, Oct 9, 2018 at 10:05 PM Greg Ewing wrote: > > Chris Angelico wrote: > > > In contrast, a mmap'd file is memory that you do indeed own. > > > > Although it's not really accurate to say that it's owned by > > a

Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Greg Ewing
Stephen J. Turnbull wrote: Subject to COW, I presume. Probably in units smaller than the whole file (per page?) It can be COW or not, depending on the options passed to mmap. And yes, it's mapped in units of pages. -- Greg ___ Python-ideas mailing

Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Chris Angelico
On Wed, Oct 10, 2018 at 5:09 AM Stephen J. Turnbull wrote: > > Chris Angelico writes: > > > Both processes are using the virtual memory. Either or both could be > > using physical memory. Assuming they haven't written to the pages > > (which is the case with executables - the system mmaps the

Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Stephen J. Turnbull
Chris Angelico writes: > Both processes are using the virtual memory. Either or both could be > using physical memory. Assuming they haven't written to the pages > (which is the case with executables - the system mmaps the binary into > your memory space as read-only), and assuming that those

Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Chris Angelico
On Wed, Oct 10, 2018 at 2:42 AM Stephen J. Turnbull wrote: > > Chris Angelico writes: > > On Tue, Oct 9, 2018 at 10:05 PM Greg Ewing > wrote: > > > > > > Chris Angelico wrote: > > > > In contrast, a mmap'd file is memory that you do indeed own. > > > > > > Although it's not really

Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Stephen J. Turnbull
Chris Angelico writes: > On Tue, Oct 9, 2018 at 10:05 PM Greg Ewing > wrote: > > > > Chris Angelico wrote: > > > In contrast, a mmap'd file is memory that you do indeed own. > > > > Although it's not really accurate to say that it's owned by > > a particular process. If two processes

Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Chris Angelico
On Tue, Oct 9, 2018 at 10:05 PM Greg Ewing wrote: > > Chris Angelico wrote: > > In contrast, a mmap'd file is memory that you do indeed own. > > Although it's not really accurate to say that it's owned by > a particular process. If two processes mmap the same file, > the physical memory pages

Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Greg Ewing
Chris Angelico wrote: In contrast, a mmap'd file is memory that you do indeed own. Although it's not really accurate to say that it's owned by a particular process. If two processes mmap the same file, the physical memory pages holding it appear in the address spaces of both processes. --

Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Chris Angelico
On Mon, Oct 8, 2018 at 11:15 PM Anders Hovmöller wrote: > > > However, another possibility is the the regexp is consuming lots of memory. > > The regexp seems simple enough (b'.'), so I doubt it is leaking memory like > mad; I'm guessing you're just seeing the OS page in as much of the file as it

Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Ram Rachum
Thanks for your help everybody! I'm very happy to have learned about mmap. On Mon, Oct 8, 2018 at 3:27 PM Richard Damon wrote: > On 10/8/18 8:11 AM, Ram Rachum wrote: > > " Windows will aggressively fill up your RAM in cases like this > > because after all why not? There's no use to having

Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Richard Damon
On 10/8/18 8:11 AM, Ram Rachum wrote: > " Windows will aggressively fill up your RAM in cases like this > because after all why not?  There's no use to having memory just > sitting around unused." > > Two questions: > > 1. Is the "why not" sarcastic, as in you're agreeing it's a waste? > 2. Will

Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Anders Hovmöller
>> However, another possibility is the the regexp is consuming lots of memory. >> >> The regexp seems simple enough (b'.'), so I doubt it is leaking memory like >> mad; I'm guessing you're just seeing the OS page in as much of the file as it >> can. > > Yup. Windows will aggressively fill up

Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Ram Rachum
" Windows will aggressively fill up your RAM in cases like this because after all why not? There's no use to having memory just sitting around unused." Two questions: 1. Is the "why not" sarcastic, as in you're agreeing it's a waste? 2. Will this be different on Linux? Which command do I run on

Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Erik Bray
On Mon, Oct 8, 2018 at 12:20 PM Cameron Simpson wrote: > > On 08Oct2018 10:56, Ram Rachum wrote: > >That's incredibly interesting. I've never used mmap before. > >However, there's a problem. > >I did a few experiments with mmap now, this is the latest: > > > >path = pathlib.Path(r'P:\huge_file')

Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Ram Rachum
I'm not an expert on memory. I used Process Explorer to look at the Process. The Working Set of the current run is 11GB. The Private Bytes is 708MB. Actually, see all the info here: https://www.dropbox.com/s/tzoud028pzdkfi7/screenshot_TURING_2018-10-08_133355.jpg?dl=0 I've got 16GB of RAM on this

Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Cameron Simpson
On 08Oct2018 10:56, Ram Rachum wrote: That's incredibly interesting. I've never used mmap before. However, there's a problem. I did a few experiments with mmap now, this is the latest: path = pathlib.Path(r'P:\huge_file') with path.open('r') as file: mmap = mmap.mmap(file.fileno(), 0,

Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Ram Rachum
That's incredibly interesting. I've never used mmap before. However, there's a problem. I did a few experiments with mmap now, this is the latest: path = pathlib.Path(r'P:\huge_file') with path.open('r') as file: mmap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) for match in

Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Nathaniel Smith
On Sun, Oct 7, 2018 at 5:54 PM, Nathaniel Smith wrote: > Are you imagining something roughly like this? (Ignoring chunk > boundary handling for the moment.) > > def find_double_line_end(buf): > start = 0 > while True: > next_idx = buf.index(b"\n", start) > if buf[next_idx

Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Nathaniel Smith
On Sun, Oct 7, 2018 at 5:09 PM, Terry Reedy wrote: > On 10/6/2018 5:00 PM, Nathaniel Smith wrote: >> >> On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum wrote: >>> >>> I'd like to use the re module to parse a long text file, 1GB in size. I >>> wish >>> that the re module could parse a stream, so I

Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Terry Reedy
On 10/7/2018 12:32 AM, Ram Rachum wrote: Does that mean I'll have to write that character-by-character algorithm? I would not be surprised if you could make use of str.index, which scans at C speed. See my answer to Nathaniel. -- Terry Jan Reedy

Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Terry Reedy
On 10/6/2018 5:00 PM, Nathaniel Smith wrote: On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum wrote: I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over

Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Greg Ewing
Jonathan Fine wrote: Provided mmap releases memory when possible, It will. The virtual memory system will read pages from the file into RAM when needed, and re-use those RAM pages for other purposes when needed. It should be pretty much the most efficient solution possible. -- Greg

Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Jonathan Fine
Anders wrote > An mmap object is one of the things you can make a memoryview of, > although looking again, it seems you don't even need to, you can > just re.search the mmap object directly. > > re.search'ing the mmap object means the operating system takes care of > the streaming for you,

Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread 2015
On 18-10-07 16.15, Ram Rachum wrote: > I tested it now and indeed bytes patterns work on memoryview objects. > But how do I use this to scan for patterns through a stream without > loading it to memory? An mmap object is one of the things you can make a memoryview of, although looking again, it

Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Ram Rachum
I tested it now and indeed bytes patterns work on memoryview objects. But how do I use this to scan for patterns through a stream without loading it to memory? On Sun, Oct 7, 2018 at 4:24 PM <2...@jmunch.dk> wrote: > On 18-10-07 15.11, Ram Rachum wrote: > > > Unfortunately, it's not helpful. I

Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread 2015
On 18-10-07 15.11, Ram Rachum wrote: > Unfortunately, it's not helpful. I was developing a solution similar to yours before I came to the conclusion that a multilne regex would be more elegant. How about memory mapping your 1GB file? bytes patterns work on memoryviews. regards, Anders

Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Ram Rachum
Hi Cameron, Thanks for putting in the time to study my problem and sketch a solution. Unfortunately, it's not helpful. I was developing a solution similar to yours before I came to the conclusion that a multilne regex would be more elegant. I find this algorithm to be quite complicated. It's

Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Nathaniel Smith
On Sat, Oct 6, 2018, 18:40 Steven D'Aprano wrote: > The message I take from this is: > > - regex engines certainly can be written to support streaming data; > - but few of them are; > - and it is exceedingly unlikely to be able to easily (or at all) > retro-fit that support to Python's

Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Cameron Simpson
On 07Oct2018 07:32, Ram Rachum wrote: On Sun, Oct 7, 2018 at 4:40 AM Steven D'Aprano wrote: I'm sure that Python will never be as efficient as C in that regard (although PyPy might argue the point) but is there something we can do to ameliorate this? If we could make char-by-char processing

Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Cameron Simpson
On 07Oct2018 07:30, Ram Rachum wrote: I'm doing multi-color 3d-printing. The slicing software generates a GCode file, which is a text file of instructions for the printer, each command meaning something like "move the head to coordinates x,y,z while extruding plastic at a rate of w" and lots of

Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Ram Rachum
On Sun, Oct 7, 2018 at 4:40 AM Steven D'Aprano wrote: > I'm sure that Python will never be as efficient as C in that regard > (although PyPy might argue the point) but is there something we can do > to ameliorate this? If we could make char-by-char processing only 10 > times less efficient than

Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Ram Rachum
Hi Ned! I'm happy to see you here. I'm doing multi-color 3d-printing. The slicing software generates a GCode file, which is a text file of instructions for the printer, each command meaning something like "move the head to coordinates x,y,z while extruding plastic at a rate of w" and lots of

Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Steven D'Aprano
On Sat, Oct 06, 2018 at 02:00:27PM -0700, Nathaniel Smith wrote: > Fortunately, there's an elegant and natural solution: Just save the > regex engine's internal state when it hits the end of the string, and > then when more data arrives, use the saved state to pick up the search > where we left

Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Nathaniel Smith
On Sat, Oct 6, 2018 at 2:04 PM, Chris Angelico wrote: > On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith wrote: >> >> On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum wrote: >> > I'd like to use the re module to parse a long text file, 1GB in size. I >> > wish >> > that the re module could parse a

Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Chris Angelico
On Sun, Oct 7, 2018 at 9:54 AM Nathaniel Smith wrote: > > On Sat, Oct 6, 2018 at 2:04 PM, Chris Angelico wrote: > > On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith wrote: > >> > >> On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum wrote: > >> > I'd like to use the re module to parse a long text file,

Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Chris Angelico
On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith wrote: > > On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum wrote: > > I'd like to use the re module to parse a long text file, 1GB in size. I wish > > that the re module could parse a stream, so I wouldn't have to load the > > whole thing into memory.

Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Nathaniel Smith
On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum wrote: > I'd like to use the re module to parse a long text file, 1GB in size. I wish > that the re module could parse a stream, so I wouldn't have to load the > whole thing into memory. I'd like to iterate over matches from the stream > without keeping

Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Ned Batchelder
On 10/6/18 7:25 AM, Ram Rachum wrote: "This is a regular expression problem, rather than a Python problem." Do you have evidence for this assertion, except that other regex implementations have this limitation? Is there a regex specification somewhere that specifies that streams aren't

Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Jonathan Fine
I wrote: > This is a regular expression problem, rather than a Python problem. Ram wrote: > Do you have evidence for this assertion, except that > other regex implementations have this limitation? Yes. 1. I've already supplied: https://svn.boost.org/trac10/ticket/11776 2.

Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Ram Rachum
"This is a regular expression problem, rather than a Python problem." Do you have evidence for this assertion, except that other regex implementations have this limitation? Is there a regex specification somewhere that specifies that streams aren't supported? Is there a fundamental reason that

Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Jonathan Fine
Hi Ram You wrote: > I'd like to use the re module to parse a long text file, 1GB in size. I > wish that the re module could parse a stream, so I wouldn't have to load > the whole thing into memory. I'd like to iterate over matches from the > stream without keeping the old matches and input in

Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Ram Rachum
It'll load as much as it needs to in order to match or rule out a match on a pattern. If you'd try to match `a.*b` it'll load the whole thing. The use cases that are relevant to a stream wouldn't have these kinds of problems. On Sat, Oct 6, 2018 at 11:22 AM Serhiy Storchaka wrote: > 06.10.18

Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Serhiy Storchaka
06.10.18 10:22, Ram Rachum пише: I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and

[Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Ram Rachum
Hi, I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM. What do you think?