Re: efficiently building a large list

2012-05-31 Thread Henrik Sarvell
What happens if you first load the whole thing into memory, ie (let
Invoices (in ... and then work with the Invoices variable?



On Thu, May 31, 2012 at 9:56 PM, Joe Bogner joebog...@gmail.com wrote:
 I'd like to do some analysis on a large text file of invoices - for example,
 grouping and summing. Is this an efficient way to build the list?

 The file has 4 million rows in it. The first several hundred thousand load
 quickly and then I notice the time between my checkpoints taking longer (10
 secs+ per 10,000). I ran it for about 10 minutes and killed it and was
 around 2.4M rows

 I played around with other variations of building a list push1, idx, etc and
 as expected they all had a more significant time growth as the list grew.

   (make
     (in invoices.txt
       (until (eof)
         (setq Line (line) )
         (at (0 . 1) (prinl Line))
         (link (mapcar pack (split Line ^I))) ) ) )


 My file has:
 Customer #
 Prod #
 Amount
 Quantity

 I'm experimenting with loading into a db right now to see if that yields
 better results.

 Thanks
 Joe
--
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: efficiently building a large list

2012-05-31 Thread Alexander Burger
Hi Joe,

 I'd like to do some analysis on a large text file of invoices - for
 example, grouping and summing. Is this an efficient way to build the list?
 
 The file has 4 million rows in it. The first several hundred thousand load
 quickly and then I notice the time between my checkpoints taking longer (10
 secs+ per 10,000). I ran it for about 10 minutes and killed it and was
 around 2.4M rows

I think this is simply a memory problem.

Each line results in a list of 4 'pack'ed symbols, i.e. 5 cells plus the
symbols with each at least 1 cell. So 2.4M rows should be at least 2.3
GB on a 64-bit machine.

The interpreter keeps allocating more and more memory, an additional M
on each garbage collection. You can speed that up if you run

   (gc 2300)

in the beginning. This will pre-allocate 2.3 G if you have enough RAM.
If the available RAM is smaller, the process will start to trash pages
and slow down a lot.


In general, I think it is not wise to read so much information into a
flat list.



 I played around with other variations of building a list push1, idx, etc
 and as expected they all had a more significant time growth as the list
 grew.

Yes, your way of using 'make' and 'link' is the best.


 I'm experimenting with loading into a db right now to see if that yields
 better results.

Yes, that's better.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: efficiently building a large list

2012-05-31 Thread Alexander Burger
On Thu, May 31, 2012 at 05:35:09PM +0200, Alexander Burger wrote:
 Each line results in a list of 4 'pack'ed symbols, i.e. 5 cells plus the
 symbols with each at least 1 cell. So 2.4M rows should be at least 2.3
 GB on a 64-bit machine.

Oops! Shame!

2.4e6 times 6 cells on a 64-bit machine are 23040 bytes. That's only
230 MByte, not 2.4 GByte!


Anyway, you should first try:

 The interpreter keeps allocating more and more memory, an additional M
 on each garbage collection. You can speed that up if you run
 
(gc 2300)

using, e.g. (gc 250) instead, to pre-allocate memory. This avoids that
the garbage collector runs again and again.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: efficiently building a large list

2012-05-31 Thread Alexander Burger
 using, e.g. (gc 250) instead, to pre-allocate memory. This avoids that
 the garbage collector runs again and again.

BTW, 'proc' is very useful here to check the process size:

$ pil +
: (proc 'pil)
  PID  PPID  STARTED  SIZE %CPU WCHAN  CMD
25575  1831 18:14:55  1512  0.3 -  /usr/bin/picolisp /usr/lib/picolisp/lib.l
- T

: (gc 250)
- 250

: (proc 'pil)
  PID  PPID  STARTED  SIZE %CPU WCHAN  CMD
25575  1831 18:14:55 258512 2.8 -  /usr/bin/picolisp /usr/lib/picolisp/lib.l
- T
:

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: efficiently building a large list

2012-05-31 Thread Tomas Hlavaty
Hi Joe,

 Sidebar: Is there a way to disable the interactive session from
 printing the return of a statement? For example, if I do a (setq ABC
 L) where L is a million items, I'd prefer the option of not having all
 million items print on my console. I've worked around this by wrapping
 it in a prog and returning NIL. Is there an easier way?

you could also use http://software-lab.de/doc/refN.html#nil or
http://software-lab.de/doc/refT.html#t

Cheers,

Tomas
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: efficiently building a large list

2012-05-31 Thread Joe Bogner
Thanks Tomas, I've started using nil now.

 This is what I came up with to aggregate the data. It actually runs
reasonably well. I'm sharing because I always enjoy reading other people's
picoLisp code so I figure others may as well.

My source file has 4 million rows

: (bench (pivot L 'CustNum))
35.226 sec

# outputs 31,000 rows.

My approach is to load it in as follows:

(class +Invoice)
(rel CustNum (+String))
(rel ProdNum (+String))
(rel Amount (+Number))
(rel Quantity (+Number))

(de Load ()
  (zero N)
  (setq L (make (
  (in invoices.txt
(until (eof)
  (setq Line (line) )
  (setq D (mapcar pack (split Line ^I)))
  (link (new
'(+Invoice)
'CustNum (car (nth D 1))
'ProdNum (car (nth D 2))
'Amount (format (car (nth D 3)))
'Quantity (format (car (nth D 4))) )) ) ) ) ) ) T )


I can probably clean this up.  I tinkered around with various approaches
and this was the best I could come up with in a few hours. At first I was
using something like the group from lib.l but found it to be too slow. I
think it was due to the fact that I optimize for a sorted list instead of
scanning for a match in the made list

(de sortedGroup (List Fld)
  (make
(let (Last NIL LastSym NIL)
 (for This List
  (let Key (get This Fld)
(if ( Last Key)
(prog
(if LastSym (link LastSym))
(off LastSym)
(push 'LastSym Key)) )
 (push 'LastSym This)
 (setq Last Key) ) )
 (link LastSym)) ) )

And here's the piece that ties it all together:

(de pivot (L Fld)
  (let (SL (by '((X) (get X Fld)) sort L) SG (sortedGroup SL Fld))
(out pivot.txt
  (for X SG
(let (Amt 0)
  (mapc '((This) (inc 'Amt (: Amount))) (cdr (reverse X)))
  (setq Key (get (car X) Fld))
  (prinl Key ^I Amt) ) ) ) ) )


(Load)

: (bench (pivot L 'CustNum))
35.226 sec

: (bench (pivot L 'ProdNum))
40.945 sec

It seems the best performance was by sorting, then splitting and then
summing the individual parts. It also makes for a nice report.

Sidenote: At first I thought I was getting better performance by using a
modified version of quicksort off rosetta code, but then I switched it to
the built-in sort and saw considerably better speed.

Thanks for the help everyone

On Thu, May 31, 2012 at 3:37 PM, Tomas Hlavaty t...@logand.com wrote:

 Hi Joe,

  Sidebar: Is there a way to disable the interactive session from
  printing the return of a statement? For example, if I do a (setq ABC
  L) where L is a million items, I'd prefer the option of not having all
  million items print on my console. I've worked around this by wrapping
  it in a prog and returning NIL. Is there an easier way?

 you could also use http://software-lab.de/doc/refN.html#nil or
 http://software-lab.de/doc/refT.html#t

 Cheers,

 Tomas
 --
 UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe