Re: efficiently building a large list
What happens if you first load the whole thing into memory, ie (let Invoices (in ... and then work with the Invoices variable? On Thu, May 31, 2012 at 9:56 PM, Joe Bogner joebog...@gmail.com wrote: I'd like to do some analysis on a large text file of invoices - for example, grouping and summing. Is this an efficient way to build the list? The file has 4 million rows in it. The first several hundred thousand load quickly and then I notice the time between my checkpoints taking longer (10 secs+ per 10,000). I ran it for about 10 minutes and killed it and was around 2.4M rows I played around with other variations of building a list push1, idx, etc and as expected they all had a more significant time growth as the list grew. (make (in invoices.txt (until (eof) (setq Line (line) ) (at (0 . 1) (prinl Line)) (link (mapcar pack (split Line ^I))) ) ) ) My file has: Customer # Prod # Amount Quantity I'm experimenting with loading into a db right now to see if that yields better results. Thanks Joe -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Re: efficiently building a large list
Hi Joe, I'd like to do some analysis on a large text file of invoices - for example, grouping and summing. Is this an efficient way to build the list? The file has 4 million rows in it. The first several hundred thousand load quickly and then I notice the time between my checkpoints taking longer (10 secs+ per 10,000). I ran it for about 10 minutes and killed it and was around 2.4M rows I think this is simply a memory problem. Each line results in a list of 4 'pack'ed symbols, i.e. 5 cells plus the symbols with each at least 1 cell. So 2.4M rows should be at least 2.3 GB on a 64-bit machine. The interpreter keeps allocating more and more memory, an additional M on each garbage collection. You can speed that up if you run (gc 2300) in the beginning. This will pre-allocate 2.3 G if you have enough RAM. If the available RAM is smaller, the process will start to trash pages and slow down a lot. In general, I think it is not wise to read so much information into a flat list. I played around with other variations of building a list push1, idx, etc and as expected they all had a more significant time growth as the list grew. Yes, your way of using 'make' and 'link' is the best. I'm experimenting with loading into a db right now to see if that yields better results. Yes, that's better. Cheers, - Alex -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Re: efficiently building a large list
On Thu, May 31, 2012 at 05:35:09PM +0200, Alexander Burger wrote: Each line results in a list of 4 'pack'ed symbols, i.e. 5 cells plus the symbols with each at least 1 cell. So 2.4M rows should be at least 2.3 GB on a 64-bit machine. Oops! Shame! 2.4e6 times 6 cells on a 64-bit machine are 23040 bytes. That's only 230 MByte, not 2.4 GByte! Anyway, you should first try: The interpreter keeps allocating more and more memory, an additional M on each garbage collection. You can speed that up if you run (gc 2300) using, e.g. (gc 250) instead, to pre-allocate memory. This avoids that the garbage collector runs again and again. Cheers, - Alex -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Re: efficiently building a large list
using, e.g. (gc 250) instead, to pre-allocate memory. This avoids that the garbage collector runs again and again. BTW, 'proc' is very useful here to check the process size: $ pil + : (proc 'pil) PID PPID STARTED SIZE %CPU WCHAN CMD 25575 1831 18:14:55 1512 0.3 - /usr/bin/picolisp /usr/lib/picolisp/lib.l - T : (gc 250) - 250 : (proc 'pil) PID PPID STARTED SIZE %CPU WCHAN CMD 25575 1831 18:14:55 258512 2.8 - /usr/bin/picolisp /usr/lib/picolisp/lib.l - T : Cheers, - Alex -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Re: efficiently building a large list
Hi Joe, Sidebar: Is there a way to disable the interactive session from printing the return of a statement? For example, if I do a (setq ABC L) where L is a million items, I'd prefer the option of not having all million items print on my console. I've worked around this by wrapping it in a prog and returning NIL. Is there an easier way? you could also use http://software-lab.de/doc/refN.html#nil or http://software-lab.de/doc/refT.html#t Cheers, Tomas -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Re: efficiently building a large list
Thanks Tomas, I've started using nil now. This is what I came up with to aggregate the data. It actually runs reasonably well. I'm sharing because I always enjoy reading other people's picoLisp code so I figure others may as well. My source file has 4 million rows : (bench (pivot L 'CustNum)) 35.226 sec # outputs 31,000 rows. My approach is to load it in as follows: (class +Invoice) (rel CustNum (+String)) (rel ProdNum (+String)) (rel Amount (+Number)) (rel Quantity (+Number)) (de Load () (zero N) (setq L (make ( (in invoices.txt (until (eof) (setq Line (line) ) (setq D (mapcar pack (split Line ^I))) (link (new '(+Invoice) 'CustNum (car (nth D 1)) 'ProdNum (car (nth D 2)) 'Amount (format (car (nth D 3))) 'Quantity (format (car (nth D 4))) )) ) ) ) ) ) T ) I can probably clean this up. I tinkered around with various approaches and this was the best I could come up with in a few hours. At first I was using something like the group from lib.l but found it to be too slow. I think it was due to the fact that I optimize for a sorted list instead of scanning for a match in the made list (de sortedGroup (List Fld) (make (let (Last NIL LastSym NIL) (for This List (let Key (get This Fld) (if ( Last Key) (prog (if LastSym (link LastSym)) (off LastSym) (push 'LastSym Key)) ) (push 'LastSym This) (setq Last Key) ) ) (link LastSym)) ) ) And here's the piece that ties it all together: (de pivot (L Fld) (let (SL (by '((X) (get X Fld)) sort L) SG (sortedGroup SL Fld)) (out pivot.txt (for X SG (let (Amt 0) (mapc '((This) (inc 'Amt (: Amount))) (cdr (reverse X))) (setq Key (get (car X) Fld)) (prinl Key ^I Amt) ) ) ) ) ) (Load) : (bench (pivot L 'CustNum)) 35.226 sec : (bench (pivot L 'ProdNum)) 40.945 sec It seems the best performance was by sorting, then splitting and then summing the individual parts. It also makes for a nice report. Sidenote: At first I thought I was getting better performance by using a modified version of quicksort off rosetta code, but then I switched it to the built-in sort and saw considerably better speed. Thanks for the help everyone On Thu, May 31, 2012 at 3:37 PM, Tomas Hlavaty t...@logand.com wrote: Hi Joe, Sidebar: Is there a way to disable the interactive session from printing the return of a statement? For example, if I do a (setq ABC L) where L is a million items, I'd prefer the option of not having all million items print on my console. I've worked around this by wrapping it in a prog and returning NIL. Is there an easier way? you could also use http://software-lab.de/doc/refN.html#nil or http://software-lab.de/doc/refT.html#t Cheers, Tomas -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe