subject:"db new\! performance"

Re: db new! performance

2012-05-30 Thread Alexander Burger

Hi Joe,

> Thanks for the reply. Just for reference, using seq is actually
> considerably slower. It ran in 39 seconds vs. 4 seconds.

Yeah, tried it here too. It is only 9 seconds vs. 6 seconds, though.


> I think it's
> because it has to look up every object from disk to get the value of 'id
> instead of using the index which is likely in memory. The index appears to
> be stored as a simple list of external symbols and the index value.  I'm
> just guessing through.

Not really. While it is correct that the index holds values and symbols,
the value is not used here:

   (iter (tree 'id '+Invoice) '((This) (inc 'N (: id

That is, 'iter' iterates all nodes of the index tree, but then loads the
object into 'This' and (: id) fetches the value from there (your
previous example of 'scan' used only the index).


I think I know why

   (for (This (seq (db: +Invoice)) This (seq This)) (inc 'N (: id)))

is a bit slower. In principle, it has less to do, just stepping through
the objects with 'seq', which are also nicely in order in the file. But
internally it behaves differently. 'seq' is not designed for that
purpose, and doesn't use the existing symbols in the Lisp heap.


In general, I would not rely to tightly on such isolated tests like the
one above, as the results are mostly determined by the system's disk
cache behavior, and always try to optimize the production database
model.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

Re: db new! performance

2012-05-30 Thread Joe Bogner

Hi Alex,

Thanks for the reply. Just for reference, using seq is actually
considerably slower. It ran in 39 seconds vs. 4 seconds. I think it's
because it has to look up every object from disk to get the value of 'id
instead of using the index which is likely in memory. The index appears to
be stored as a simple list of external symbols and the index value.  I'm
just guessing through.

Thanks,
Joe

On Wed, May 30, 2012 at 9:36 AM, Alexander Burger wrote:

> Hi Joe,
>
> > Thank you. That sped it up. It's taking 69 seconds to insert 1M records
> >
> > (pool "foo.db")
> > (class +Invoice +Entity)
> > (rel id (+Key +Number))
> > (zero N)
> > (bench (do 100 (new (db: +Invoice) '(+Invoice) 'id (inc 'N)) ))
> > (commit)
>
> You can further speed it up if you distribute objects and indices across
> separate files. For the above example:
>
>   (class +Invoice +Entity)
>   (rel id (+Key +Number))
>
>(dbs
>  (3 )# First file, 512 byte blocks
>  (2 +Invoice)# Second file, 256 byte blocks
>  (4 (+Invoice id)) ) # Third file, 1024 byte blocks
>
> This puts the '+Invoice' objects into the second file (with a block size
> of 256), and the 'id' index into the third (with a block size of 1024).
>
> The first file (with a block size of 512) is not specified to hold any
> entities here, so it contains only the administrative data (root and
> base objects).
>
>
> Then you must pass a directory (instead of a file name) and the database
> size specifications to 'pool':
>
>   (pool "foo.db/" *Dbs)
>
> If you have really large indexes (more than, say, 10 or 100 million
> entries), the you might experiment with an even larger block size (e.g.
> 6, giving 4096 byte blocks). In my experience performance goes down
> again if you use too large block sizes.
>
>
>
> > I can work with that. Now I am testing out queries.
> >
> > ? (bench (iter (tree 'id '+Invoice) '((This) (inc 'Z (: id) )) )))
> > 11.822 sec
> >
> > ? (bench (scan (tree 'id '+Invoice) '((Val Key) (inc 'Z Val )) )))
> > 4.430 sec
> >
> > It makes sense that scan would be fastest because I can use the index
> > directly. Is that likely the fastest query to sum up a number relation?
>
> Yes, it is surely faster than a Pilog query (though less powerful).
>
> The absolutely fastest, though, would be to use 'seq', i.e. avoid
> completely to use an index. This can be used occasionally, when
> (as in the above case) a file consists mainly of objects of a
> single type:
>
>   (bench
>  (for (This (seq (db: +Invoice)) This (seq This))
>  (inc 'Z (: id)) ) )
>
> If the file also might contain other objects, use this as the last line:
>
> (and (isa '+Invoice This) (inc 'Z (: id))
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
>

Re: db new! performance

2012-05-30 Thread Alexander Burger

Hi Joe,

> Thank you. That sped it up. It's taking 69 seconds to insert 1M records
> 
> (pool "foo.db")
> (class +Invoice +Entity)
> (rel id (+Key +Number))
> (zero N)
> (bench (do 100 (new (db: +Invoice) '(+Invoice) 'id (inc 'N)) ))
> (commit)

You can further speed it up if you distribute objects and indices across
separate files. For the above example:

   (class +Invoice +Entity)
   (rel id (+Key +Number))

   (dbs
  (3 )# First file, 512 byte blocks
  (2 +Invoice)# Second file, 256 byte blocks
  (4 (+Invoice id)) ) # Third file, 1024 byte blocks

This puts the '+Invoice' objects into the second file (with a block size
of 256), and the 'id' index into the third (with a block size of 1024).

The first file (with a block size of 512) is not specified to hold any
entities here, so it contains only the administrative data (root and
base objects).


Then you must pass a directory (instead of a file name) and the database
size specifications to 'pool':

   (pool "foo.db/" *Dbs)

If you have really large indexes (more than, say, 10 or 100 million
entries), the you might experiment with an even larger block size (e.g.
6, giving 4096 byte blocks). In my experience performance goes down
again if you use too large block sizes.



> I can work with that. Now I am testing out queries.
> 
> ? (bench (iter (tree 'id '+Invoice) '((This) (inc 'Z (: id) )) )))
> 11.822 sec
> 
> ? (bench (scan (tree 'id '+Invoice) '((Val Key) (inc 'Z Val )) )))
> 4.430 sec
> 
> It makes sense that scan would be fastest because I can use the index
> directly. Is that likely the fastest query to sum up a number relation?

Yes, it is surely faster than a Pilog query (though less powerful).

The absolutely fastest, though, would be to use 'seq', i.e. avoid
completely to use an index. This can be used occasionally, when
(as in the above case) a file consists mainly of objects of a
single type:

   (bench
  (for (This (seq (db: +Invoice)) This (seq This))
 (inc 'Z (: id)) ) )

If the file also might contain other objects, use this as the last line:

 (and (isa '+Invoice This) (inc 'Z (: id))

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

Re: db new! performance

2012-05-30 Thread Joe Bogner

Thank you. That sped it up. It's taking 69 seconds to insert 1M records

(pool "foo.db")
(class +Invoice +Entity)
(rel id (+Key +Number))
(zero N)
(bench (do 100 (new (db: +Invoice) '(+Invoice) 'id (inc 'N)) ))
(commit)

I can work with that. Now I am testing out queries.

? (bench (iter (tree 'id '+Invoice) '((This) (inc 'Z (: id) )) )))
11.822 sec

? (bench (scan (tree 'id '+Invoice) '((Val Key) (inc 'Z Val )) )))
4.430 sec

It makes sense that scan would be fastest because I can use the index
directly. Is that likely the fastest query to sum up a number relation?

Thank you
Joe





On Wed, May 30, 2012 at 2:10 AM, Alexander Burger wrote:

> On Wed, May 30, 2012 at 12:28:50PM +0700, Henrik Sarvell wrote:
> > Use new and chunk it up:
> >
> >(dbSync)
> >(for A As
> >   (at (0 . 1000) (commit 'upd) (prune) (dbSync))
> >   (new (db: +Article) '(+Article) key1 value1 key2 value2 ... ))
> >(commit 'upd)
> >
> > With new! you are locking and writing every row so should only be used
> > in cases where you know you are only inserting one (or maybe very
> > few).
> >
> > Above we create them in memory and write 1000 of them at a time.
> >
> > If you have 12 million you should probably use an even higher number
> than 1000.
>
> Yes, I usually use 1. Larger values seem not to bring any further
> improvements, and use too much memory.
>
>
> You can slightly simplify and speed up the above, if you do not need to
> synchronize with other processes during that import (i.e. if other
> processes can wait until the import is done). Then you can omit the
> calls to 'commit' with 'upd' (signaling "done" to other processes) and
> the (dbSync) calls in the loop.
>
> And, in the final end, you could call (prune T) to reset to normal
> behavior, though not doing this will not have any bad effect.
>
> With that, we would have
>
>   (dbSync)
>   (for ...
>   (new (db: +Article) '(+Article) 'key1  'key2  ... )
>   (at (0 . 1) (commit) (prune)) )
>   (commit 'upd)
>   (prune T)
>
>
> Practically, I would also omit the 'prune' calls, and only insert them
> if I find that the import process uses too much money (monitor with
> 'top' or 'ps'). This speeds up small imports (which include 12 million
> objects).
>
> This would simplify the import to
>
>   (dbSync)
>   (for ...
>   (new (db: +Article) '(+Article) 'key1  'key2  ... )
>   (at (0 . 1) (commit)) )
>   (commit 'upd)
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
>

Re: db new! performance

2012-05-29 Thread Alexander Burger

On Wed, May 30, 2012 at 12:28:50PM +0700, Henrik Sarvell wrote:
> Use new and chunk it up:
> 
>(dbSync)
>(for A As
>   (at (0 . 1000) (commit 'upd) (prune) (dbSync))
>   (new (db: +Article) '(+Article) key1 value1 key2 value2 ... ))
>(commit 'upd)
> 
> With new! you are locking and writing every row so should only be used
> in cases where you know you are only inserting one (or maybe very
> few).
> 
> Above we create them in memory and write 1000 of them at a time.
> 
> If you have 12 million you should probably use an even higher number than 
> 1000.

Yes, I usually use 1. Larger values seem not to bring any further
improvements, and use too much memory.

You can slightly simplify and speed up the above, if you do not need to
synchronize with other processes during that import (i.e. if other
processes can wait until the import is done). Then you can omit the
calls to 'commit' with 'upd' (signaling "done" to other processes) and
the (dbSync) calls in the loop.

And, in the final end, you could call (prune T) to reset to normal
behavior, though not doing this will not have any bad effect.

With that, we would have

   (dbSync)
   (for ...
  (new (db: +Article) '(+Article) 'key1  'key2  ... )
  (at (0 . 1) (commit) (prune)) )
   (commit 'upd)
   (prune T)

Practically, I would also omit the 'prune' calls, and only insert them
if I find that the import process uses too much money (monitor with
'top' or 'ps'). This speeds up small imports (which include 12 million
objects).

This would simplify the import to

   (dbSync)
   (for ...
  (new (db: +Article) '(+Article) 'key1  'key2  ... )
  (at (0 . 1) (commit)) )
   (commit 'upd)

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

Re: db new! performance

2012-05-29 Thread Jakob Eriksson

It depends of course. In the rare case that you actually need each
row to be securely on disk before writing the next one, the original
approach was correct, but flushing each row will take some time
in SQL databases too. (Google for Transactions Per Minute.)

best,
Jakob

On May 30, 2012 at 7:28 AM Henrik Sarvell  wrote:

> Use new and chunk it up:

[snip]

>
> Above we create them in memory and write 1000 of them at a time.
>
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

Re: db new! performance

2012-05-29 Thread Henrik Sarvell

Use new and chunk it up:

   (dbSync)
   (for A As
  (at (0 . 1000) (commit 'upd) (prune) (dbSync))
  (new (db: +Article) '(+Article) key1 value1 key2 value2 ... ))
   (commit 'upd)

With new! you are locking and writing every row so should only be used
in cases where you know you are only inserting one (or maybe very
few).

Above we create them in memory and write 1000 of them at a time.

If you have 12 million you should probably use an even higher number than 1000.

/Henrik

On Wed, May 30, 2012 at 10:36 AM, Joe Bogner  wrote:
> I'm evaluating the use of picolisp for analyzing large datasets. Is it
> surprising that inserting a million rows into a simple db would take 5+
> minutes on modern hardware? I killed it after that after about 500K were
> inserted. I checked by ctrl+c and then inspecting N. It seems to
> progressively get slower after about 100K records.
>
> (pool "foo.db")
> (class +Invoice +Entity)
> (rel nr (+Key +Number))
> (zero N)
> (do 100 (new! '(+Invoice) 'nr (inc 'N)))
>
> I have just testing out the concept. My input data will be a flat file of
> invoice data (12 million rows+)
>
> Thanks
> Joe
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

db new! performance

2012-05-29 Thread Joe Bogner

I'm evaluating the use of picolisp for analyzing large datasets. Is it
surprising that inserting a million rows into a simple db would take 5+
minutes on modern hardware? I killed it after that after about 500K were
inserted. I checked by ctrl+c and then inspecting N. It seems to
progressively get slower after about 100K records.

(pool "foo.db")
(class +Invoice +Entity)
(rel nr (+Key +Number))
(zero N)
(do 100 (new! '(+Invoice) 'nr (inc 'N)))

I have just testing out the concept. My input data will be a flat file of
invoice data (12 million rows+)

Thanks
Joe

Re: db new! performance

Re: db new! performance

Re: db new! performance

Re: db new! performance

Re: db new! performance

Re: db new! performance

Re: db new! performance

db new! performance

8 matches

Site Navigation

Mail list logo

Footer information