On Sunday 29 October 2006 18:47, Joe Wilson wrote:
> The last test that simulated pre-sorting 2 million rows
> in index order prior to insert may show optimal insert speed
> since it's only appending pages to the table and the index,
> but it may not be realistic, since you may not have sufficient
> memory to pre-sort all data in memory prior to insert.
> (Although if you have the memory to spare, this is the way to go).

Presorting is unfortunately not always possible, e.g. when 
there are several indexes (with different order). But increasing 
the cache seems a good solution for this anyway (and is less 
cumbersome than presorting or staging). Your ideas set me thinking
on why even with the larger cache, we still do worse than mysql on large 
datasets, and did some further experimenting (code at the end of the mail, 
all tests with larger cache size). 

I guess that it is not only the number of entries that is 
important, but that there are at least two different factors 
affecting the performance: the number of different values actually present 
in the index, and the number of rows that have one value

The number of values has an important influence. If there is 
only a limited number of possible values, performance stays nearly 
identical regardless of the data size (and thus faster than mysql).

For larger numbers of possible values behaviour gets a bit 
unexpected (to me at least): the time needed for random inserts rises 
about linear with the data size, until it reaches a certain point, 
after which the insert time remains the same regardless of dataset 
size. The height of this plateau seems to related to the number of 
possible values in a linear fashion.

Any ideas?

---------- code ----------

proc runtest {sqlfile} {
  set result {}
  set delay 1000
  exec sync; after $delay;
  set t [time "exec sqlite3 lc.db <$sqlfile" 1]
  set t [expr {[lindex $t 0]/1000000.0}]
  lappend result [format %.3f $t]
  return $result
}

set numvlist {2 10 100 200 300 400 500 600 700 800 1000 2000 3000 5000 7000 
10000}
# Initialize the environment
#
expr srand(1)
catch {exec /bin/sh -c {rm -f *.db}}
set fd [open clear.sql w]
foreach numv $numvlist {
        puts $fd "drop table t$numv;"
}
close $fd
catch {exec mysql drh <clear.sql}
set fd [open 2kinit.sql w]
puts $fd {
  PRAGMA default_cache_size=2000;
}
close $fd
exec sqlite3 def.db <2kinit.sql
set fd [open nosync-init.sql w]
puts $fd {
  PRAGMA default_cache_size=20000;
}
close $fd
exec sqlite3 lc.db <nosync-init.sql

# initialize
set fd [open test.sql w]
foreach numv $numvlist {
        puts $fd "BEGIN;"
        puts $fd "CREATE TABLE t${numv}(a INTEGER, b INTEGER);"
        puts $fd "CREATE INDEX tq_b${numv} on t${numv}(b);"
        puts $fd "COMMIT;"
}
close $fd
runtest test.sql

set o [open results.csv w]
set temp "tablesize"
foreach numv $numvlist {
#       append temp \tsqlite$numv\tsqlite-lc$numv\t mysql$numv
        append temp \t$numv
}
puts $o $temp
set step 20000
foreach numv $numvlist {
        set num 0
        puts "tablesize: $num    numv: $numv"
        set fd [open test.sql w]
        puts $fd "BEGIN;"
        set r 0
        for {set i 1} {$i<=$step} {incr i} {
                puts $fd "INSERT INTO t$numv VALUES([expr {$num+$i}],$r);"
                incr r
                if {$r == $numv} {set r 0}
        }
        puts $fd "COMMIT;"
        close $fd
        eval lappend resultline [runtest test.sql]
}
puts $o $num\t[join $resultline \t]
flush $o
puts start
for {} {$num < 2000000} {incr num $step} {
        set resultline {}
        foreach numv $numvlist {
                puts "tablesize: $num    numv: $numv"
                set fd [open test.sql w]
                puts $fd "BEGIN;"
                for {set i 1} {$i<=$step} {incr i} {
                        set r [expr {int(rand()*$numv)}]
                        puts $fd "INSERT INTO t$numv VALUES([expr 
{$num+$i}],$r);"
                }
                puts $fd "COMMIT;"
                close $fd
                eval lappend resultline [runtest test.sql]
        }
        puts $o $num\t[join $resultline \t]
        flush $o
}
close $o



-- 
Dr Peter De Rijk                         E-mail: [EMAIL PROTECTED]
Bioinformatics Unit                          Tel. +32 3 265 10 08
Department of Molecular Genetics VIB8        Fax. +32 3 265 10 12
University of Antwerp                http://www.molgen.ua.ac.be/
Universiteitsplein 1                 
B-2610 Antwerpen, Belgium

The glass is not half full, nor half empty. The glass is just too big.

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Reply via email to