Re: Creation stamp

2014-02-10 Thread Henrik Sarvell
Hi Joe, I seemed to get the best performance doing a commit + prune
everyday. Tried every 7 days first (see at) but from my experience the
prune needs to happen more often.


On Mon, Feb 10, 2014 at 6:37 PM, Joe Bogner joebog...@gmail.com wrote:

 Henrik - Thank you for posting the code. I enjoyed tinkering around with
 it. The inserts took a long time --- I stopped after about 30m and then
 added some timing info. I think it was taking about 20 seconds per day and
 that time will grow if I recall correctly. I am guessing it would take 2-3
 hours to insert the 31M rows (on SSD and a xen environment) and a fair
 amount of disk space. I think I was up to about 2 gig with 50 days.

 I may look further into experimenting with different block sizes:
 https://www.mail-archive.com/picolisp@software-lab.de/msg03304.html

 If you end up speeding up please share. I know it's just a mock example so
 may not be worth the time. It's nice to have small reproducible examples.

 It's neat to hear that the queries are sub second.

 Thanks
 Joe




 On Sun, Feb 9, 2014 at 6:24 AM, Henrik Sarvell hsarv...@gmail.com wrote:

 Yes, a bit perhaps.

 I tested, it is of no consequence (at least for my applications), given
 one transaction per second for a full year, fetching a random +Ref +String
 day takes a fraction of a second on my PC equipped with SSD, here is the
 code:

 Note that it's only the collect at the end that takes a fraction of a
 section, the insertions do NOT.

 (class +Transaction +Entity)
 (rel amount (+Number))
 (rel createdAt (+Ref +String))

 (dbs
(4 +Transaction)
(4 (+Transaction createdAt)) )

 (pool /opt/picolisp/projects/test/db/db *Dbs)

 (setq Sday (date 2013 01 01))
 (setq Eday (+ Sday 364))
 (setq F (db: +Transaction))

 (for (D Sday (= Eday D) (inc D))
(for (S 1 (= 86400 S) (inc S))
   (let Stamp (stamp D S)
  (println Stamp)
  (new F '(+Transaction) 'amount 100 'createdAt Stamp) ) )
(commit)
(prune) )

 (commit)
 (prune T)

 (println (collect 'createdAt '+Transaction 2013-10-05 00:00:00
 2013-10-05 23:59:59))

 (bye)




 On Sat, Feb 8, 2014 at 5:44 PM, Alexander Burger a...@software-lab.dewrote:

 Hi Henrik,

 On Fri, Feb 07, 2014 at 08:29:07PM +0700, Henrik Sarvell wrote:
  Given a very large amount of external objects, representing for
 instance
  transactions, what would be the quickest way of handling the creation
 stamp
  be with regards to future lookups by way of start stamp and end stamp?
 
  It seems to me that using two relations might be optimal, one +Ref
 +Date
  and an extra +Ref +Time. Then a lookup could first use the +Date
 relation
  to filter out all transactions that weren't created during the
 specified
  days followed by (optionally) a filter by +Time.

 You could use two separate relations, but then I would definitely
 combine them with '+Aux'

(rel d (+Aux +Ref +Date) (t)) # Date
(rel t (+Time))   # Time

 In this way a single B-Tree access is sufficient to find any time range

Re: Creation stamp

2014-02-10 Thread Alexander Burger
Hi Joe + Henrik,

On Mon, Feb 10, 2014 at 06:37:34AM -0500, Joe Bogner wrote:
 If you end up speeding up please share. I know it's just a mock example so
 may not be worth the time. It's nice to have small reproducible examples.

Oops! I just notice that the 'prune' semantics Henrik uses is outdated.

I'm not sure which PicoLisp version you use, but 'prune' has changed in
last December (with 3.1.4.13) ... Sorry, should have posted a note about
this :(

If you have a more recent version, you should call 'prune' with a count
during import operation, and just (prune) (i.e. (prune NIL)) to disable
pruning. Otherwise, pruning is not enabled at all, and your process
keeps growing and growing ...

With that, the example becomes

   (for (D Sday (= Eday D) (inc D))
  (for (S 1 (= 86400 S) (inc S))
 ... )
  (commit)
  (prune 10) )
   (commit)
   (prune)


Also, you can save quite some time if you pre-allocate memory, to avoid
an increase with each garbage collection. I would call (gc 800) in the
beginning, to allocate 800 MB, and (gc 0) in the end. This gives:

   (gc 800 100)
   (for (D Sday (= Eday D) (inc D))
  (for (S 1 (= 86400 S) (inc S))
 (let Stamp (stamp D S)
## (println Stamp)
(new F '(+Transaction) 'amount 100 'createdAt Stamp) ) )
  (commit)
  (prune 10) )
   (commit)
   (prune)
   (gc 0)

I tested with 10% of the data, and got a speed increase by a factor of
seventeen (2493 sec vs. 145 sec on a notebood with HD, no SSD).



 It's neat to hear that the queries are sub second.

Yes, but I still feel uneasy by storing time and date as a string in the
database. Not only being inconvenient to use for date calculations, a
string like

   2013-10-05 00:00:00

takes up 20 bytes both in the entity object and in the index tree, while
a date and a time like

   (735580 53948)

takes only 10 bytes. In the experiments above, the calls to (stamp) took
21.4 seconds in total. That's 1/7th of the total import time. In
addition, because of the smaller key size, you get more index entries
into a disk block, further increasing import speed.

♪♫ Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: Creation stamp

2014-02-10 Thread Henrik Sarvell
So by (735580 53948) you mean a +Ref +List? Is it possible to get a range
by way of collect with that setup?

I tested with two separate relations, ie one +Ref +Time and one +Ref +Date,
the database file ended up the same size.


On Mon, Feb 10, 2014 at 9:31 PM, Alexander Burger a...@software-lab.dewrote:

 Hi Joe + Henrik,

 On Mon, Feb 10, 2014 at 06:37:34AM -0500, Joe Bogner wrote:
  If you end up speeding up please share. I know it's just a mock example
 so
  may not be worth the time. It's nice to have small reproducible examples.

 Oops! I just notice that the 'prune' semantics Henrik uses is outdated.

 I'm not sure which PicoLisp version you use, but 'prune' has changed in
 last December (with 3.1.4.13) ... Sorry, should have posted a note about
 this :(

 If you have a more recent version, you should call 'prune' with a count
 during import operation, and just (prune) (i.e. (prune NIL)) to disable
 pruning. Otherwise, pruning is not enabled at all, and your process
 keeps growing and growing ...

 With that, the example becomes

(for (D Sday (= Eday D) (inc D))
   (for (S 1 (= 86400 S) (inc S))
  ... )
   (commit)
   (prune 10) )
(commit)
(prune)


 Also, you can save quite some time if you pre-allocate memory, to avoid
 an increase with each garbage collection. I would call (gc 800) in the
 beginning, to allocate 800 MB, and (gc 0) in the end. This gives:

(gc 800 100)
(for (D Sday (= Eday D) (inc D))
   (for (S 1 (= 86400 S) (inc S))
  (let Stamp (stamp D S)
 ## (println Stamp)
 (new F '(+Transaction) 'amount 100 'createdAt Stamp) ) )
   (commit)
   (prune 10) )
(commit)
(prune)
(gc 0)

 I tested with 10% of the data, and got a speed increase by a factor of
 seventeen (2493 sec vs. 145 sec on a notebood with HD, no SSD).



  It's neat to hear that the queries are sub second.

 Yes, but I still feel uneasy by storing time and date as a string in the
 database. Not only being inconvenient to use for date calculations, a
 string like

2013-10-05 00:00:00

 takes up 20 bytes both in the entity object and in the index tree, while
 a date and a time like

(735580 53948)

 takes only 10 bytes. In the experiments above, the calls to (stamp) took
 21.4 seconds in total. That's 1/7th of the total import time. In
 addition, because of the smaller key size, you get more index entries
 into a disk block, further increasing import speed.

 ♪♫ Alex
 --
 UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe



Re: Creation stamp

2014-02-10 Thread Alexander Burger
On Mon, Feb 10, 2014 at 09:57:13PM +0700, Henrik Sarvell wrote:
 So by (735580 53948) you mean a +Ref +List? Is it possible to get a range
 by way of collect with that setup?

No, not a +Ref +List. As I propoesed on Feb 08

   (rel d (+Aux +Ref +Date) (t)) # Date
   (rel t (+Time))   # Time

or

   (rel ts (+Ref +Bag) ((+Date)) ((+Time)))  # Timestamp

In both cases, the index key will be (date time). You can use
(collect ..) or Piloq queries.


 I tested with two separate relations, ie one +Ref +Time and one +Ref +Date,
 the database file ended up the same size.

The file size for the entities themselves will probably stay the same,
as usually the block size is sufficiently big.

But the index tree should need fewer blocks (as more entries go into
each block). The PicoLisp B-Trees don't use a fixed number of entries
per block, but fill up each block until it is full and thus needs to be
split.

♪♫ Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: Creation stamp

2014-02-10 Thread Joe Bogner
Hey Alex -

On Mon, Feb 10, 2014 at 9:31 AM, Alexander Burger a...@software-lab.dewrote:

 Also, you can save quite some time if you pre-allocate memory, to avoid
 an increase with each garbage collection. I would call (gc 800) in the
 beginning, to allocate 800 MB, and (gc 0) in the end.



Thanks for the reminder about gc I remember you mentioning it over a year
ago: https://www.mail-archive.com/picolisp@software-lab.de/msg03308.html. I
added the gc and completed 30 days of import in two minutes. I also
switched to my i7 (under cygwin too) vs my xen virtual host. It ended up
using 2.7 gig of disk so I had to stop it.  Again, I'm reminded and
impressed with the speed.

Thanks
Joe


Re: Creation stamp

2014-02-10 Thread Henrik Sarvell
The index file is 1.3GB in the +Bag case, 2GB in the +String case, doesn't
seem like a big deal to me given that the main entity file ends up being
32GB.

Now I haven't checked, but due to the relative size of the files the range
query might be comparably faster but in my case a tenth of a second here
and there won't matter.


On Tue, Feb 11, 2014 at 2:54 AM, Joe Bogner joebog...@gmail.com wrote:

 Hey Alex -

 On Mon, Feb 10, 2014 at 9:31 AM, Alexander Burger a...@software-lab.dewrote:

 Also, you can save quite some time if you pre-allocate memory, to avoid
 an increase with each garbage collection. I would call (gc 800) in the
 beginning, to allocate 800 MB, and (gc 0) in the end.



 Thanks for the reminder about gc I remember you mentioning it over a year
 ago: https://www.mail-archive.com/picolisp@software-lab.de/msg03308.html.
 I added the gc and completed 30 days of import in two minutes. I also
 switched to my i7 (under cygwin too) vs my xen virtual host. It ended up
 using 2.7 gig of disk so I had to stop it.  Again, I'm reminded and
 impressed with the speed.

 Thanks
 Joe



Re: Creation stamp

2014-02-09 Thread Henrik Sarvell
Yes, a bit perhaps.

I tested, it is of no consequence (at least for my applications), given one
transaction per second for a full year, fetching a random +Ref +String day
takes a fraction of a second on my PC equipped with SSD, here is the code:

Note that it's only the collect at the end that takes a fraction of a
section, the insertions do NOT.

(class +Transaction +Entity)
(rel amount (+Number))
(rel createdAt (+Ref +String))

(dbs
   (4 +Transaction)
   (4 (+Transaction createdAt)) )

(pool /opt/picolisp/projects/test/db/db *Dbs)

(setq Sday (date 2013 01 01))
(setq Eday (+ Sday 364))
(setq F (db: +Transaction))

(for (D Sday (= Eday D) (inc D))
   (for (S 1 (= 86400 S) (inc S))
  (let Stamp (stamp D S)
 (println Stamp)
 (new F '(+Transaction) 'amount 100 'createdAt Stamp) ) )
   (commit)
   (prune) )

(commit)
(prune T)

(println (collect 'createdAt '+Transaction 2013-10-05 00:00:00
2013-10-05 23:59:59))

(bye)




On Sat, Feb 8, 2014 at 5:44 PM, Alexander Burger a...@software-lab.dewrote:

 Hi Henrik,

 On Fri, Feb 07, 2014 at 08:29:07PM +0700, Henrik Sarvell wrote:
  Given a very large amount of external objects, representing for instance
  transactions, what would be the quickest way of handling the creation
 stamp
  be with regards to future lookups by way of start stamp and end stamp?
 
  It seems to me that using two relations might be optimal, one +Ref +Date
  and an extra +Ref +Time. Then a lookup could first use the +Date relation
  to filter out all transactions that weren't created during the specified
  days followed by (optionally) a filter by +Time.

 You could use two separate relations, but then I would definitely
 combine them with '+Aux'

(rel d (+Aux +Ref +Date) (t)) # Date
(rel t (+Time))   # Time

 In this way a single B-Tree access is sufficient to find any time range.
 For example, to find all entities between today noon and tomorrow noon:

(collect 'd '+Mup
   (list (date) (time 12 0 0))
   (list (inc (date)) (time 11 59 59)) )


 Another possibility is using not two separate relations, but a single
 bag relation

(rel ts (+Ref +Bag) ((+Date)) ((+Time)))  # Timestamp

 This saves a little space in the objects, but results in the same index
 entry format.


 But anyway, in both cases a single index tree is used. In the first case
 you also have the option to define the time as

(rel t (+Ref +Time))  # Time

 with an additional separate index, so that you can search also for
 certain time ranges only (no matter what the date is).


  Or am I over-thinking it, is a simple +Ref +Number with a UNIX timestamp
 an
  easier approach that is just as fast?

 I think this would not make any difference in speed (regarding index
 access), but would have some disadvantages, like having to convert this
 format to/from PicoLisp date and time values, and being limited in range
 (the Unix timestamp cannot represent dates before 1970).


  A +Ref +String storing the result of a call to stamp would be ideal as
 the
  information is human readable without conversions. However, I suspect
 that
  a start-end lookup on it would be much slower than the above, or?

 Yes, a bit perhaps. Parsing and printing human readable date and time
 values is simple in PicoLisp (e.g. with 'date', 'stamp', 'datStr' and
 related functions, see http://software-lab.de/doc/refD.html#date).

 ♪♫ Alex
 --
 UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe



Re: Creation stamp

2014-02-08 Thread Alexander Burger
Hi Henrik,

On Fri, Feb 07, 2014 at 08:29:07PM +0700, Henrik Sarvell wrote:
 Given a very large amount of external objects, representing for instance
 transactions, what would be the quickest way of handling the creation stamp
 be with regards to future lookups by way of start stamp and end stamp?
 
 It seems to me that using two relations might be optimal, one +Ref +Date
 and an extra +Ref +Time. Then a lookup could first use the +Date relation
 to filter out all transactions that weren't created during the specified
 days followed by (optionally) a filter by +Time.

You could use two separate relations, but then I would definitely
combine them with '+Aux'

   (rel d (+Aux +Ref +Date) (t)) # Date
   (rel t (+Time))   # Time

In this way a single B-Tree access is sufficient to find any time range.
For example, to find all entities between today noon and tomorrow noon:

   (collect 'd '+Mup
  (list (date) (time 12 0 0))
  (list (inc (date)) (time 11 59 59)) )


Another possibility is using not two separate relations, but a single
bag relation

   (rel ts (+Ref +Bag) ((+Date)) ((+Time)))  # Timestamp

This saves a little space in the objects, but results in the same index
entry format.


But anyway, in both cases a single index tree is used. In the first case
you also have the option to define the time as

   (rel t (+Ref +Time))  # Time

with an additional separate index, so that you can search also for
certain time ranges only (no matter what the date is).


 Or am I over-thinking it, is a simple +Ref +Number with a UNIX timestamp an
 easier approach that is just as fast?

I think this would not make any difference in speed (regarding index
access), but would have some disadvantages, like having to convert this
format to/from PicoLisp date and time values, and being limited in range
(the Unix timestamp cannot represent dates before 1970).


 A +Ref +String storing the result of a call to stamp would be ideal as the
 information is human readable without conversions. However, I suspect that
 a start-end lookup on it would be much slower than the above, or?

Yes, a bit perhaps. Parsing and printing human readable date and time
values is simple in PicoLisp (e.g. with 'date', 'stamp', 'datStr' and
related functions, see http://software-lab.de/doc/refD.html#date).

♪♫ Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Creation stamp

2014-02-07 Thread Henrik Sarvell
Given a very large amount of external objects, representing for instance
transactions, what would be the quickest way of handling the creation stamp
be with regards to future lookups by way of start stamp and end stamp?

It seems to me that using two relations might be optimal, one +Ref +Date
and an extra +Ref +Time. Then a lookup could first use the +Date relation
to filter out all transactions that weren't created during the specified
days followed by (optionally) a filter by +Time.

Or am I over-thinking it, is a simple +Ref +Number with a UNIX timestamp an
easier approach that is just as fast?

A +Ref +String storing the result of a call to stamp would be ideal as the
information is human readable without conversions. However, I suspect that
a start-end lookup on it would be much slower than the above, or?