Hi Bo,

> boun...@sqlite.org] On Behalf Of Bo Peng
> > I wonder if it would be better on just having the data organized
> > before loading it, so that the records in each of the 5000 tables
> > would be contiguously stored. Of course, that also depends on how much
> > new data will be added to the tables.
> 
> I have been thinking about this too, but I do not know how to do it
> efficiently.
>
>3. Create 5000 files, insert records to them, and import the files to the
main database. This seems to be the best option although I need to pay
>attention to OS limit on opened files.
> 
> Any suggest is welcome.

A variation of option 3 could is to load the data to an SQLite In-Memory
database (see http://www.sqlite.org/inmemorydb.html) in chunks so that every
time the in-memory database is full you open the disk files consecutively
one at a time appending the data to them from the corresponding table in the
memory database. Afterwards the in-memory database is cleared (close the
connection or drop the tables) and the cycle starts over.
This way you will not have to worry about the number of open files in the OS
which could give complications as you write.

You could of course code this this yourself holding the same amount of the
original records in a memory structure directly without using SQLite.
However, I could imagine that it would be faster to implement (and less
error prone) just using the SQLite in-memory database.

Once all 5000 files had been fully written they would then be loaded to the
final SQLite database.

> > Is the primary key an integer so that it in fact is the rowid thus
> > saving some space?
> > (see http://sqlite.org/lang_createtable.html, the section: "ROWIDs and
> > the INTEGER PRIMARY KEY".)
> 
> This certainly worth trying, as long as the performance is acceptable.
> It would be good if I can tell sqlite that there is no duplicate and
missing
> values so it does not have to check.

Well, if you don't have any particular use of a primary key its probably
best to not define any so that no processing time is used on it.
Also, if you do define a column as "INTEGER PRIMARY KEY" it becomes an alias
for the rowid for that record and thus will not be a value that you supply.
The rowid in this case will be frozen for a given record and stay the same
even when a vacuum is done on the database.

> > If there are several calculations to be done for each table it would
> > perhaps be better to work with several threads concentrating on a
> > single table at a time. This would be particularly effective if your
> > cache_size was set large enough for a table to be contained entirely
> > in memory as the threads could share the same connection and (as I
> > understand it) memory cache (see
> http://www.sqlite.org/sharedcache.html).
> 
> I am running SELECT COUNT(*), MAX(QT) FROM TABLE_XX; etc as long as the
> WHERE clause is the same. I am also running queries in multiple threads
> which works well on SSD drive, but not on mechanical drive due to the
> bottleneck on random-access speed. I am not sure if we are talking about
the
> same cache but because the tables are spread sparsely across the database,
> caching more pages did not really help.

What I mean here is that if you have a number of different COUNT, MAX or
other statistical functions to be calculated for example with several
different WHERE clauses (e.g. 10, 20 or 30), it will pay off to focus on one
table at a time performing all the queries (whether consecutively or
concurrently using several threads) as the table will be loaded to the
memory cache the first time it is accessed. This is assuming that a table
contains about a million records and thus can be contained entirely in
memory.
This will work regardless of whether you have organized your database so
that records physically are stored contiguously. Of course, it is still best
to have the data stored contiguously as that will give a huge load
improvement.
If there is only a single SELECT COUNT and MAX statement to be done for the
entire 5000 tables it will not make any difference.

> > One might even consider a different strategy:
> > Do not use the statistical function of SQLite (count, average, max,
> > etc.) but access the records individually in your code. This requires
> > that data are loaded according to the primary key and that the threads
> > accessing them do so in a turn based fashion, using the modulo
> > function to decide which thread gets to handle the record, e.g.:
> 
> If all tables have the same set of item IDs, this can be a really good
idea. This
> is unfortunately not the case because each tables have a different set of
IDs,
> despite of 70% to 90% overlapping keys. I even do not know in advance all
> available IDs.

Okay, so each table will have to be handled by itself. 


Best regards,

Frank

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to