On 6/10/2014 2:25 PM, Simon Slavin wrote:

On 10 Jun 2014, at 7:22pm, Fabian Giesen <fabi...@radgametools.com> wrote:

  1. Bulk insertion of relatively large amounts of data (low hundreds of 
megabytes to tens of gigabytes); a typical profile generates on the order of 
10MB of sampling data per second.

I presume you do your INSERTs inside a transaction.  If not, you should.

Yes, it's all wrapped in transactions; typically on the order of 10-100
a second. We're talking (order of magnitude) about 100k rows inserted per second of capture here; it would be nowhere near fast enough if
every INSERT was its own transaction.

Each of these statements ends up scanning over the entire table once. Since the 
tables in question are quite large, I would love to have some way to create 
multiple indexes in one pass.

Not possible in current SQLite.  No standard syntax for doing it in SQL.  A 
quick google suggests that no other popular DBMS implements it.

I beg to differ.

MySQL's InnoDB has it, for example:

  http://dev.mysql.com/doc/refman/5.5/en/innodb-create-index-examples.html

And ... I'm not sure it would be that much faster.  The majority of the work 
it's doing is writing.  Reading the existing table data is fast and easy, but 
writing new data and figuring out the pointers takes the majority of the time.  
Because of this I don't think it likely that this feature will be added to 
standard SQLite.  Of course, if you really care about this you could add it 
yourself.

The majority of the work is sorting, as Richard correctly points out,
not writing. The indexes in question are all on one or two integer keys
so they don't end up particularly big.

The average sample record comes out at ~100 bytes. All the records are
integer values; anything more complicated associated with a sample is
stored as unique records in separate tables and referenced via foreign
keys.

To optimize this, when you create the table put the columns you're going to 
index near the beginning of the rows.  SQLite stops reading the data of a table 
row once it has all the columns it needs.  I don't know whether this will make 
a lot of difference but it won't do any harm.

When the dataset is small enough to fit in memory (this is on
workstation PCs, so this means "less than a few gigabytes") we're CPU
bound (doing sorting, mostly, as Richard correctly suggested).

However, looking at the sorting time vs. number of records, sorting (in
that scenario, for our indices) typically proceeds at well over 1M
records/s (depends on the machine, of course; on my work PC it's
closer to 1.4Mrecs/s).

At that rate, in an out-of-core situation, we would need to be reading
~140MB/s of records, sustained, during index build time, plus the write
bandwidth for the resulting index (a small fraction; the indexes on the
large tables are all on one or two integer keys), to not become IO bound. That's possible with a good hard drive (and most SSDs), but it's
closer than I'm fully comfortable with.

The reason I'm asking is because we're seeing customers running into
disproportionately slow indexing times at the end of long (~1h!)
capture runs; since the dataset size is the only thing that's
fundamentally different for longer runs, I started looking into it. I'm
well aware this is an extreme case in various ways though. I just wanted
to ask to make sure there wasn't a way I was missing!

Also, store your time as a numeric value (Unix epoch or Julian day) rather than 
a string.  Numbers are faster to handle than strings.  INTEGERs are faster to 
handle than REALs.

:-)

The time fields are 64-bit ints. The exact meaning depends on the
capture target platform (we support several) but usually it's either
a system-synchronized CPU clock cycle counter (when available) or
ticks for a high-frequency bus/event timer.

-Fabian
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to