Re: Creation stamp
Hi Joe, I seemed to get the best performance doing a commit + prune everyday. Tried every 7 days first (see at) but from my experience the prune needs to happen more often. On Mon, Feb 10, 2014 at 6:37 PM, Joe Bogner joebog...@gmail.com wrote: Henrik - Thank you for posting the code. I enjoyed tinkering around with it. The inserts took a long time --- I stopped after about 30m and then added some timing info. I think it was taking about 20 seconds per day and that time will grow if I recall correctly. I am guessing it would take 2-3 hours to insert the 31M rows (on SSD and a xen environment) and a fair amount of disk space. I think I was up to about 2 gig with 50 days. I may look further into experimenting with different block sizes: https://www.mail-archive.com/picolisp@software-lab.de/msg03304.html If you end up speeding up please share. I know it's just a mock example so may not be worth the time. It's nice to have small reproducible examples. It's neat to hear that the queries are sub second. Thanks Joe On Sun, Feb 9, 2014 at 6:24 AM, Henrik Sarvell hsarv...@gmail.com wrote: Yes, a bit perhaps. I tested, it is of no consequence (at least for my applications), given one transaction per second for a full year, fetching a random +Ref +String day takes a fraction of a second on my PC equipped with SSD, here is the code: Note that it's only the collect at the end that takes a fraction of a section, the insertions do NOT. (class +Transaction +Entity) (rel amount (+Number)) (rel createdAt (+Ref +String)) (dbs (4 +Transaction) (4 (+Transaction createdAt)) ) (pool /opt/picolisp/projects/test/db/db *Dbs) (setq Sday (date 2013 01 01)) (setq Eday (+ Sday 364)) (setq F (db: +Transaction)) (for (D Sday (= Eday D) (inc D)) (for (S 1 (= 86400 S) (inc S)) (let Stamp (stamp D S) (println Stamp) (new F '(+Transaction) 'amount 100 'createdAt Stamp) ) ) (commit) (prune) ) (commit) (prune T) (println (collect 'createdAt '+Transaction 2013-10-05 00:00:00 2013-10-05 23:59:59)) (bye) On Sat, Feb 8, 2014 at 5:44 PM, Alexander Burger a...@software-lab.dewrote: Hi Henrik, On Fri, Feb 07, 2014 at 08:29:07PM +0700, Henrik Sarvell wrote: Given a very large amount of external objects, representing for instance transactions, what would be the quickest way of handling the creation stamp be with regards to future lookups by way of start stamp and end stamp? It seems to me that using two relations might be optimal, one +Ref +Date and an extra +Ref +Time. Then a lookup could first use the +Date relation to filter out all transactions that weren't created during the specified days followed by (optionally) a filter by +Time. You could use two separate relations, but then I would definitely combine them with '+Aux' (rel d (+Aux +Ref +Date) (t)) # Date (rel t (+Time)) # Time In this way a single B-Tree access is sufficient to find any time range
Re: Creation stamp
Hi Joe + Henrik, On Mon, Feb 10, 2014 at 06:37:34AM -0500, Joe Bogner wrote: If you end up speeding up please share. I know it's just a mock example so may not be worth the time. It's nice to have small reproducible examples. Oops! I just notice that the 'prune' semantics Henrik uses is outdated. I'm not sure which PicoLisp version you use, but 'prune' has changed in last December (with 3.1.4.13) ... Sorry, should have posted a note about this :( If you have a more recent version, you should call 'prune' with a count during import operation, and just (prune) (i.e. (prune NIL)) to disable pruning. Otherwise, pruning is not enabled at all, and your process keeps growing and growing ... With that, the example becomes (for (D Sday (= Eday D) (inc D)) (for (S 1 (= 86400 S) (inc S)) ... ) (commit) (prune 10) ) (commit) (prune) Also, you can save quite some time if you pre-allocate memory, to avoid an increase with each garbage collection. I would call (gc 800) in the beginning, to allocate 800 MB, and (gc 0) in the end. This gives: (gc 800 100) (for (D Sday (= Eday D) (inc D)) (for (S 1 (= 86400 S) (inc S)) (let Stamp (stamp D S) ## (println Stamp) (new F '(+Transaction) 'amount 100 'createdAt Stamp) ) ) (commit) (prune 10) ) (commit) (prune) (gc 0) I tested with 10% of the data, and got a speed increase by a factor of seventeen (2493 sec vs. 145 sec on a notebood with HD, no SSD). It's neat to hear that the queries are sub second. Yes, but I still feel uneasy by storing time and date as a string in the database. Not only being inconvenient to use for date calculations, a string like 2013-10-05 00:00:00 takes up 20 bytes both in the entity object and in the index tree, while a date and a time like (735580 53948) takes only 10 bytes. In the experiments above, the calls to (stamp) took 21.4 seconds in total. That's 1/7th of the total import time. In addition, because of the smaller key size, you get more index entries into a disk block, further increasing import speed. ♪♫ Alex -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Re: Creation stamp
So by (735580 53948) you mean a +Ref +List? Is it possible to get a range by way of collect with that setup? I tested with two separate relations, ie one +Ref +Time and one +Ref +Date, the database file ended up the same size. On Mon, Feb 10, 2014 at 9:31 PM, Alexander Burger a...@software-lab.dewrote: Hi Joe + Henrik, On Mon, Feb 10, 2014 at 06:37:34AM -0500, Joe Bogner wrote: If you end up speeding up please share. I know it's just a mock example so may not be worth the time. It's nice to have small reproducible examples. Oops! I just notice that the 'prune' semantics Henrik uses is outdated. I'm not sure which PicoLisp version you use, but 'prune' has changed in last December (with 3.1.4.13) ... Sorry, should have posted a note about this :( If you have a more recent version, you should call 'prune' with a count during import operation, and just (prune) (i.e. (prune NIL)) to disable pruning. Otherwise, pruning is not enabled at all, and your process keeps growing and growing ... With that, the example becomes (for (D Sday (= Eday D) (inc D)) (for (S 1 (= 86400 S) (inc S)) ... ) (commit) (prune 10) ) (commit) (prune) Also, you can save quite some time if you pre-allocate memory, to avoid an increase with each garbage collection. I would call (gc 800) in the beginning, to allocate 800 MB, and (gc 0) in the end. This gives: (gc 800 100) (for (D Sday (= Eday D) (inc D)) (for (S 1 (= 86400 S) (inc S)) (let Stamp (stamp D S) ## (println Stamp) (new F '(+Transaction) 'amount 100 'createdAt Stamp) ) ) (commit) (prune 10) ) (commit) (prune) (gc 0) I tested with 10% of the data, and got a speed increase by a factor of seventeen (2493 sec vs. 145 sec on a notebood with HD, no SSD). It's neat to hear that the queries are sub second. Yes, but I still feel uneasy by storing time and date as a string in the database. Not only being inconvenient to use for date calculations, a string like 2013-10-05 00:00:00 takes up 20 bytes both in the entity object and in the index tree, while a date and a time like (735580 53948) takes only 10 bytes. In the experiments above, the calls to (stamp) took 21.4 seconds in total. That's 1/7th of the total import time. In addition, because of the smaller key size, you get more index entries into a disk block, further increasing import speed. ♪♫ Alex -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Re: Creation stamp
On Mon, Feb 10, 2014 at 09:57:13PM +0700, Henrik Sarvell wrote: So by (735580 53948) you mean a +Ref +List? Is it possible to get a range by way of collect with that setup? No, not a +Ref +List. As I propoesed on Feb 08 (rel d (+Aux +Ref +Date) (t)) # Date (rel t (+Time)) # Time or (rel ts (+Ref +Bag) ((+Date)) ((+Time))) # Timestamp In both cases, the index key will be (date time). You can use (collect ..) or Piloq queries. I tested with two separate relations, ie one +Ref +Time and one +Ref +Date, the database file ended up the same size. The file size for the entities themselves will probably stay the same, as usually the block size is sufficiently big. But the index tree should need fewer blocks (as more entries go into each block). The PicoLisp B-Trees don't use a fixed number of entries per block, but fill up each block until it is full and thus needs to be split. ♪♫ Alex -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Re: Creation stamp
Hey Alex - On Mon, Feb 10, 2014 at 9:31 AM, Alexander Burger a...@software-lab.dewrote: Also, you can save quite some time if you pre-allocate memory, to avoid an increase with each garbage collection. I would call (gc 800) in the beginning, to allocate 800 MB, and (gc 0) in the end. Thanks for the reminder about gc I remember you mentioning it over a year ago: https://www.mail-archive.com/picolisp@software-lab.de/msg03308.html. I added the gc and completed 30 days of import in two minutes. I also switched to my i7 (under cygwin too) vs my xen virtual host. It ended up using 2.7 gig of disk so I had to stop it. Again, I'm reminded and impressed with the speed. Thanks Joe
Re: Creation stamp
The index file is 1.3GB in the +Bag case, 2GB in the +String case, doesn't seem like a big deal to me given that the main entity file ends up being 32GB. Now I haven't checked, but due to the relative size of the files the range query might be comparably faster but in my case a tenth of a second here and there won't matter. On Tue, Feb 11, 2014 at 2:54 AM, Joe Bogner joebog...@gmail.com wrote: Hey Alex - On Mon, Feb 10, 2014 at 9:31 AM, Alexander Burger a...@software-lab.dewrote: Also, you can save quite some time if you pre-allocate memory, to avoid an increase with each garbage collection. I would call (gc 800) in the beginning, to allocate 800 MB, and (gc 0) in the end. Thanks for the reminder about gc I remember you mentioning it over a year ago: https://www.mail-archive.com/picolisp@software-lab.de/msg03308.html. I added the gc and completed 30 days of import in two minutes. I also switched to my i7 (under cygwin too) vs my xen virtual host. It ended up using 2.7 gig of disk so I had to stop it. Again, I'm reminded and impressed with the speed. Thanks Joe
Re: Creation stamp
Yes, a bit perhaps. I tested, it is of no consequence (at least for my applications), given one transaction per second for a full year, fetching a random +Ref +String day takes a fraction of a second on my PC equipped with SSD, here is the code: Note that it's only the collect at the end that takes a fraction of a section, the insertions do NOT. (class +Transaction +Entity) (rel amount (+Number)) (rel createdAt (+Ref +String)) (dbs (4 +Transaction) (4 (+Transaction createdAt)) ) (pool /opt/picolisp/projects/test/db/db *Dbs) (setq Sday (date 2013 01 01)) (setq Eday (+ Sday 364)) (setq F (db: +Transaction)) (for (D Sday (= Eday D) (inc D)) (for (S 1 (= 86400 S) (inc S)) (let Stamp (stamp D S) (println Stamp) (new F '(+Transaction) 'amount 100 'createdAt Stamp) ) ) (commit) (prune) ) (commit) (prune T) (println (collect 'createdAt '+Transaction 2013-10-05 00:00:00 2013-10-05 23:59:59)) (bye) On Sat, Feb 8, 2014 at 5:44 PM, Alexander Burger a...@software-lab.dewrote: Hi Henrik, On Fri, Feb 07, 2014 at 08:29:07PM +0700, Henrik Sarvell wrote: Given a very large amount of external objects, representing for instance transactions, what would be the quickest way of handling the creation stamp be with regards to future lookups by way of start stamp and end stamp? It seems to me that using two relations might be optimal, one +Ref +Date and an extra +Ref +Time. Then a lookup could first use the +Date relation to filter out all transactions that weren't created during the specified days followed by (optionally) a filter by +Time. You could use two separate relations, but then I would definitely combine them with '+Aux' (rel d (+Aux +Ref +Date) (t)) # Date (rel t (+Time)) # Time In this way a single B-Tree access is sufficient to find any time range. For example, to find all entities between today noon and tomorrow noon: (collect 'd '+Mup (list (date) (time 12 0 0)) (list (inc (date)) (time 11 59 59)) ) Another possibility is using not two separate relations, but a single bag relation (rel ts (+Ref +Bag) ((+Date)) ((+Time))) # Timestamp This saves a little space in the objects, but results in the same index entry format. But anyway, in both cases a single index tree is used. In the first case you also have the option to define the time as (rel t (+Ref +Time)) # Time with an additional separate index, so that you can search also for certain time ranges only (no matter what the date is). Or am I over-thinking it, is a simple +Ref +Number with a UNIX timestamp an easier approach that is just as fast? I think this would not make any difference in speed (regarding index access), but would have some disadvantages, like having to convert this format to/from PicoLisp date and time values, and being limited in range (the Unix timestamp cannot represent dates before 1970). A +Ref +String storing the result of a call to stamp would be ideal as the information is human readable without conversions. However, I suspect that a start-end lookup on it would be much slower than the above, or? Yes, a bit perhaps. Parsing and printing human readable date and time values is simple in PicoLisp (e.g. with 'date', 'stamp', 'datStr' and related functions, see http://software-lab.de/doc/refD.html#date). ♪♫ Alex -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Re: Creation stamp
Hi Henrik, On Fri, Feb 07, 2014 at 08:29:07PM +0700, Henrik Sarvell wrote: Given a very large amount of external objects, representing for instance transactions, what would be the quickest way of handling the creation stamp be with regards to future lookups by way of start stamp and end stamp? It seems to me that using two relations might be optimal, one +Ref +Date and an extra +Ref +Time. Then a lookup could first use the +Date relation to filter out all transactions that weren't created during the specified days followed by (optionally) a filter by +Time. You could use two separate relations, but then I would definitely combine them with '+Aux' (rel d (+Aux +Ref +Date) (t)) # Date (rel t (+Time)) # Time In this way a single B-Tree access is sufficient to find any time range. For example, to find all entities between today noon and tomorrow noon: (collect 'd '+Mup (list (date) (time 12 0 0)) (list (inc (date)) (time 11 59 59)) ) Another possibility is using not two separate relations, but a single bag relation (rel ts (+Ref +Bag) ((+Date)) ((+Time))) # Timestamp This saves a little space in the objects, but results in the same index entry format. But anyway, in both cases a single index tree is used. In the first case you also have the option to define the time as (rel t (+Ref +Time)) # Time with an additional separate index, so that you can search also for certain time ranges only (no matter what the date is). Or am I over-thinking it, is a simple +Ref +Number with a UNIX timestamp an easier approach that is just as fast? I think this would not make any difference in speed (regarding index access), but would have some disadvantages, like having to convert this format to/from PicoLisp date and time values, and being limited in range (the Unix timestamp cannot represent dates before 1970). A +Ref +String storing the result of a call to stamp would be ideal as the information is human readable without conversions. However, I suspect that a start-end lookup on it would be much slower than the above, or? Yes, a bit perhaps. Parsing and printing human readable date and time values is simple in PicoLisp (e.g. with 'date', 'stamp', 'datStr' and related functions, see http://software-lab.de/doc/refD.html#date). ♪♫ Alex -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Creation stamp
Given a very large amount of external objects, representing for instance transactions, what would be the quickest way of handling the creation stamp be with regards to future lookups by way of start stamp and end stamp? It seems to me that using two relations might be optimal, one +Ref +Date and an extra +Ref +Time. Then a lookup could first use the +Date relation to filter out all transactions that weren't created during the specified days followed by (optionally) a filter by +Time. Or am I over-thinking it, is a simple +Ref +Number with a UNIX timestamp an easier approach that is just as fast? A +Ref +String storing the result of a call to stamp would be ideal as the information is human readable without conversions. However, I suspect that a start-end lookup on it would be much slower than the above, or?