Re: db new! performance
Hi Joe, > Thanks for the reply. Just for reference, using seq is actually > considerably slower. It ran in 39 seconds vs. 4 seconds. Yeah, tried it here too. It is only 9 seconds vs. 6 seconds, though. > I think it's > because it has to look up every object from disk to get the value of 'id > instead of using the index which is likely in memory. The index appears to > be stored as a simple list of external symbols and the index value. I'm > just guessing through. Not really. While it is correct that the index holds values and symbols, the value is not used here: (iter (tree 'id '+Invoice) '((This) (inc 'N (: id That is, 'iter' iterates all nodes of the index tree, but then loads the object into 'This' and (: id) fetches the value from there (your previous example of 'scan' used only the index). I think I know why (for (This (seq (db: +Invoice)) This (seq This)) (inc 'N (: id))) is a bit slower. In principle, it has less to do, just stepping through the objects with 'seq', which are also nicely in order in the file. But internally it behaves differently. 'seq' is not designed for that purpose, and doesn't use the existing symbols in the Lisp heap. In general, I would not rely to tightly on such isolated tests like the one above, as the results are mostly determined by the system's disk cache behavior, and always try to optimize the production database model. Cheers, - Alex -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Re: db new! performance
Hi Alex, Thanks for the reply. Just for reference, using seq is actually considerably slower. It ran in 39 seconds vs. 4 seconds. I think it's because it has to look up every object from disk to get the value of 'id instead of using the index which is likely in memory. The index appears to be stored as a simple list of external symbols and the index value. I'm just guessing through. Thanks, Joe On Wed, May 30, 2012 at 9:36 AM, Alexander Burger wrote: > Hi Joe, > > > Thank you. That sped it up. It's taking 69 seconds to insert 1M records > > > > (pool "foo.db") > > (class +Invoice +Entity) > > (rel id (+Key +Number)) > > (zero N) > > (bench (do 100 (new (db: +Invoice) '(+Invoice) 'id (inc 'N)) )) > > (commit) > > You can further speed it up if you distribute objects and indices across > separate files. For the above example: > > (class +Invoice +Entity) > (rel id (+Key +Number)) > >(dbs > (3 )# First file, 512 byte blocks > (2 +Invoice)# Second file, 256 byte blocks > (4 (+Invoice id)) ) # Third file, 1024 byte blocks > > This puts the '+Invoice' objects into the second file (with a block size > of 256), and the 'id' index into the third (with a block size of 1024). > > The first file (with a block size of 512) is not specified to hold any > entities here, so it contains only the administrative data (root and > base objects). > > > Then you must pass a directory (instead of a file name) and the database > size specifications to 'pool': > > (pool "foo.db/" *Dbs) > > If you have really large indexes (more than, say, 10 or 100 million > entries), the you might experiment with an even larger block size (e.g. > 6, giving 4096 byte blocks). In my experience performance goes down > again if you use too large block sizes. > > > > > I can work with that. Now I am testing out queries. > > > > ? (bench (iter (tree 'id '+Invoice) '((This) (inc 'Z (: id) )) ))) > > 11.822 sec > > > > ? (bench (scan (tree 'id '+Invoice) '((Val Key) (inc 'Z Val )) ))) > > 4.430 sec > > > > It makes sense that scan would be fastest because I can use the index > > directly. Is that likely the fastest query to sum up a number relation? > > Yes, it is surely faster than a Pilog query (though less powerful). > > The absolutely fastest, though, would be to use 'seq', i.e. avoid > completely to use an index. This can be used occasionally, when > (as in the above case) a file consists mainly of objects of a > single type: > > (bench > (for (This (seq (db: +Invoice)) This (seq This)) > (inc 'Z (: id)) ) ) > > If the file also might contain other objects, use this as the last line: > > (and (isa '+Invoice This) (inc 'Z (: id)) > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe >
Re: db new! performance
Hi Joe, > Thank you. That sped it up. It's taking 69 seconds to insert 1M records > > (pool "foo.db") > (class +Invoice +Entity) > (rel id (+Key +Number)) > (zero N) > (bench (do 100 (new (db: +Invoice) '(+Invoice) 'id (inc 'N)) )) > (commit) You can further speed it up if you distribute objects and indices across separate files. For the above example: (class +Invoice +Entity) (rel id (+Key +Number)) (dbs (3 )# First file, 512 byte blocks (2 +Invoice)# Second file, 256 byte blocks (4 (+Invoice id)) ) # Third file, 1024 byte blocks This puts the '+Invoice' objects into the second file (with a block size of 256), and the 'id' index into the third (with a block size of 1024). The first file (with a block size of 512) is not specified to hold any entities here, so it contains only the administrative data (root and base objects). Then you must pass a directory (instead of a file name) and the database size specifications to 'pool': (pool "foo.db/" *Dbs) If you have really large indexes (more than, say, 10 or 100 million entries), the you might experiment with an even larger block size (e.g. 6, giving 4096 byte blocks). In my experience performance goes down again if you use too large block sizes. > I can work with that. Now I am testing out queries. > > ? (bench (iter (tree 'id '+Invoice) '((This) (inc 'Z (: id) )) ))) > 11.822 sec > > ? (bench (scan (tree 'id '+Invoice) '((Val Key) (inc 'Z Val )) ))) > 4.430 sec > > It makes sense that scan would be fastest because I can use the index > directly. Is that likely the fastest query to sum up a number relation? Yes, it is surely faster than a Pilog query (though less powerful). The absolutely fastest, though, would be to use 'seq', i.e. avoid completely to use an index. This can be used occasionally, when (as in the above case) a file consists mainly of objects of a single type: (bench (for (This (seq (db: +Invoice)) This (seq This)) (inc 'Z (: id)) ) ) If the file also might contain other objects, use this as the last line: (and (isa '+Invoice This) (inc 'Z (: id)) Cheers, - Alex -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Re: db new! performance
Thank you. That sped it up. It's taking 69 seconds to insert 1M records (pool "foo.db") (class +Invoice +Entity) (rel id (+Key +Number)) (zero N) (bench (do 100 (new (db: +Invoice) '(+Invoice) 'id (inc 'N)) )) (commit) I can work with that. Now I am testing out queries. ? (bench (iter (tree 'id '+Invoice) '((This) (inc 'Z (: id) )) ))) 11.822 sec ? (bench (scan (tree 'id '+Invoice) '((Val Key) (inc 'Z Val )) ))) 4.430 sec It makes sense that scan would be fastest because I can use the index directly. Is that likely the fastest query to sum up a number relation? Thank you Joe On Wed, May 30, 2012 at 2:10 AM, Alexander Burger wrote: > On Wed, May 30, 2012 at 12:28:50PM +0700, Henrik Sarvell wrote: > > Use new and chunk it up: > > > >(dbSync) > >(for A As > > (at (0 . 1000) (commit 'upd) (prune) (dbSync)) > > (new (db: +Article) '(+Article) key1 value1 key2 value2 ... )) > >(commit 'upd) > > > > With new! you are locking and writing every row so should only be used > > in cases where you know you are only inserting one (or maybe very > > few). > > > > Above we create them in memory and write 1000 of them at a time. > > > > If you have 12 million you should probably use an even higher number > than 1000. > > Yes, I usually use 1. Larger values seem not to bring any further > improvements, and use too much memory. > > > You can slightly simplify and speed up the above, if you do not need to > synchronize with other processes during that import (i.e. if other > processes can wait until the import is done). Then you can omit the > calls to 'commit' with 'upd' (signaling "done" to other processes) and > the (dbSync) calls in the loop. > > And, in the final end, you could call (prune T) to reset to normal > behavior, though not doing this will not have any bad effect. > > With that, we would have > > (dbSync) > (for ... > (new (db: +Article) '(+Article) 'key1 'key2 ... ) > (at (0 . 1) (commit) (prune)) ) > (commit 'upd) > (prune T) > > > Practically, I would also omit the 'prune' calls, and only insert them > if I find that the import process uses too much money (monitor with > 'top' or 'ps'). This speeds up small imports (which include 12 million > objects). > > This would simplify the import to > > (dbSync) > (for ... > (new (db: +Article) '(+Article) 'key1 'key2 ... ) > (at (0 . 1) (commit)) ) > (commit 'upd) > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe >
Re: db new! performance
On Wed, May 30, 2012 at 12:28:50PM +0700, Henrik Sarvell wrote: > Use new and chunk it up: > >(dbSync) >(for A As > (at (0 . 1000) (commit 'upd) (prune) (dbSync)) > (new (db: +Article) '(+Article) key1 value1 key2 value2 ... )) >(commit 'upd) > > With new! you are locking and writing every row so should only be used > in cases where you know you are only inserting one (or maybe very > few). > > Above we create them in memory and write 1000 of them at a time. > > If you have 12 million you should probably use an even higher number than > 1000. Yes, I usually use 1. Larger values seem not to bring any further improvements, and use too much memory. You can slightly simplify and speed up the above, if you do not need to synchronize with other processes during that import (i.e. if other processes can wait until the import is done). Then you can omit the calls to 'commit' with 'upd' (signaling "done" to other processes) and the (dbSync) calls in the loop. And, in the final end, you could call (prune T) to reset to normal behavior, though not doing this will not have any bad effect. With that, we would have (dbSync) (for ... (new (db: +Article) '(+Article) 'key1 'key2 ... ) (at (0 . 1) (commit) (prune)) ) (commit 'upd) (prune T) Practically, I would also omit the 'prune' calls, and only insert them if I find that the import process uses too much money (monitor with 'top' or 'ps'). This speeds up small imports (which include 12 million objects). This would simplify the import to (dbSync) (for ... (new (db: +Article) '(+Article) 'key1 'key2 ... ) (at (0 . 1) (commit)) ) (commit 'upd) Cheers, - Alex -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Re: db new! performance
It depends of course. In the rare case that you actually need each row to be securely on disk before writing the next one, the original approach was correct, but flushing each row will take some time in SQL databases too. (Google for Transactions Per Minute.) best, Jakob On May 30, 2012 at 7:28 AM Henrik Sarvell wrote: > Use new and chunk it up: [snip] > > Above we create them in memory and write 1000 of them at a time. > -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Re: db new! performance
Use new and chunk it up: (dbSync) (for A As (at (0 . 1000) (commit 'upd) (prune) (dbSync)) (new (db: +Article) '(+Article) key1 value1 key2 value2 ... )) (commit 'upd) With new! you are locking and writing every row so should only be used in cases where you know you are only inserting one (or maybe very few). Above we create them in memory and write 1000 of them at a time. If you have 12 million you should probably use an even higher number than 1000. /Henrik On Wed, May 30, 2012 at 10:36 AM, Joe Bogner wrote: > I'm evaluating the use of picolisp for analyzing large datasets. Is it > surprising that inserting a million rows into a simple db would take 5+ > minutes on modern hardware? I killed it after that after about 500K were > inserted. I checked by ctrl+c and then inspecting N. It seems to > progressively get slower after about 100K records. > > (pool "foo.db") > (class +Invoice +Entity) > (rel nr (+Key +Number)) > (zero N) > (do 100 (new! '(+Invoice) 'nr (inc 'N))) > > I have just testing out the concept. My input data will be a flat file of > invoice data (12 million rows+) > > Thanks > Joe -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
db new! performance
I'm evaluating the use of picolisp for analyzing large datasets. Is it surprising that inserting a million rows into a simple db would take 5+ minutes on modern hardware? I killed it after that after about 500K were inserted. I checked by ctrl+c and then inspecting N. It seems to progressively get slower after about 100K records. (pool "foo.db") (class +Invoice +Entity) (rel nr (+Key +Number)) (zero N) (do 100 (new! '(+Invoice) 'nr (inc 'N))) I have just testing out the concept. My input data will be a flat file of invoice data (12 million rows+) Thanks Joe