Re: db new! performance

2012-05-30 Thread Alexander Burger
On Wed, May 30, 2012 at 12:28:50PM +0700, Henrik Sarvell wrote:
 Use new and chunk it up:
 
(dbSync)
(for A As
   (at (0 . 1000) (commit 'upd) (prune) (dbSync))
   (new (db: +Article) '(+Article) key1 value1 key2 value2 ... ))
(commit 'upd)
 
 With new! you are locking and writing every row so should only be used
 in cases where you know you are only inserting one (or maybe very
 few).
 
 Above we create them in memory and write 1000 of them at a time.
 
 If you have 12 million you should probably use an even higher number than 
 1000.

Yes, I usually use 1. Larger values seem not to bring any further
improvements, and use too much memory.


You can slightly simplify and speed up the above, if you do not need to
synchronize with other processes during that import (i.e. if other
processes can wait until the import is done). Then you can omit the
calls to 'commit' with 'upd' (signaling done to other processes) and
the (dbSync) calls in the loop.

And, in the final end, you could call (prune T) to reset to normal
behavior, though not doing this will not have any bad effect.

With that, we would have

   (dbSync)
   (for ...
  (new (db: +Article) '(+Article) 'key1 value1 'key2 value2 ... )
  (at (0 . 1) (commit) (prune)) )
   (commit 'upd)
   (prune T)


Practically, I would also omit the 'prune' calls, and only insert them
if I find that the import process uses too much money (monitor with
'top' or 'ps'). This speeds up small imports (which include 12 million
objects).

This would simplify the import to

   (dbSync)
   (for ...
  (new (db: +Article) '(+Article) 'key1 value1 'key2 value2 ... )
  (at (0 . 1) (commit)) )
   (commit 'upd)

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: db new! performance

2012-05-30 Thread Joe Bogner
Hi Alex,

Thanks for the reply. Just for reference, using seq is actually
considerably slower. It ran in 39 seconds vs. 4 seconds. I think it's
because it has to look up every object from disk to get the value of 'id
instead of using the index which is likely in memory. The index appears to
be stored as a simple list of external symbols and the index value.  I'm
just guessing through.

Thanks,
Joe

On Wed, May 30, 2012 at 9:36 AM, Alexander Burger a...@software-lab.dewrote:

 Hi Joe,

  Thank you. That sped it up. It's taking 69 seconds to insert 1M records
 
  (pool foo.db)
  (class +Invoice +Entity)
  (rel id (+Key +Number))
  (zero N)
  (bench (do 100 (new (db: +Invoice) '(+Invoice) 'id (inc 'N)) ))
  (commit)

 You can further speed it up if you distribute objects and indices across
 separate files. For the above example:

   (class +Invoice +Entity)
   (rel id (+Key +Number))

(dbs
  (3 )# First file, 512 byte blocks
  (2 +Invoice)# Second file, 256 byte blocks
  (4 (+Invoice id)) ) # Third file, 1024 byte blocks

 This puts the '+Invoice' objects into the second file (with a block size
 of 256), and the 'id' index into the third (with a block size of 1024).

 The first file (with a block size of 512) is not specified to hold any
 entities here, so it contains only the administrative data (root and
 base objects).


 Then you must pass a directory (instead of a file name) and the database
 size specifications to 'pool':

   (pool foo.db/ *Dbs)

 If you have really large indexes (more than, say, 10 or 100 million
 entries), the you might experiment with an even larger block size (e.g.
 6, giving 4096 byte blocks). In my experience performance goes down
 again if you use too large block sizes.



  I can work with that. Now I am testing out queries.
 
  ? (bench (iter (tree 'id '+Invoice) '((This) (inc 'Z (: id) )) )))
  11.822 sec
 
  ? (bench (scan (tree 'id '+Invoice) '((Val Key) (inc 'Z Val )) )))
  4.430 sec
 
  It makes sense that scan would be fastest because I can use the index
  directly. Is that likely the fastest query to sum up a number relation?

 Yes, it is surely faster than a Pilog query (though less powerful).

 The absolutely fastest, though, would be to use 'seq', i.e. avoid
 completely to use an index. This can be used occasionally, when
 (as in the above case) a file consists mainly of objects of a
 single type:

   (bench
  (for (This (seq (db: +Invoice)) This (seq This))
  (inc 'Z (: id)) ) )

 If the file also might contain other objects, use this as the last line:

 (and (isa '+Invoice This) (inc 'Z (: id))

 Cheers,
 - Alex
 --
 UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe



Re: db new! performance

2012-05-30 Thread Alexander Burger
Hi Joe,

 Thanks for the reply. Just for reference, using seq is actually
 considerably slower. It ran in 39 seconds vs. 4 seconds.

Yeah, tried it here too. It is only 9 seconds vs. 6 seconds, though.


 I think it's
 because it has to look up every object from disk to get the value of 'id
 instead of using the index which is likely in memory. The index appears to
 be stored as a simple list of external symbols and the index value.  I'm
 just guessing through.

Not really. While it is correct that the index holds values and symbols,
the value is not used here:

   (iter (tree 'id '+Invoice) '((This) (inc 'N (: id

That is, 'iter' iterates all nodes of the index tree, but then loads the
object into 'This' and (: id) fetches the value from there (your
previous example of 'scan' used only the index).


I think I know why

   (for (This (seq (db: +Invoice)) This (seq This)) (inc 'N (: id)))

is a bit slower. In principle, it has less to do, just stepping through
the objects with 'seq', which are also nicely in order in the file. But
internally it behaves differently. 'seq' is not designed for that
purpose, and doesn't use the existing symbols in the Lisp heap.


In general, I would not rely to tightly on such isolated tests like the
one above, as the results are mostly determined by the system's disk
cache behavior, and always try to optimize the production database
model.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe