On 11/13/07, John Stanton <[EMAIL PROTECTED]> wrote:

> The point is that the more statements you have in each transaction the
> better, up to a limit of perhaps a thousand or so.

Actually, I get the impression that number of statements is not really
the problem here.


On 11/13/07, Benilton Carvalho <[EMAIL PROTECTED]> wrote:

> I'm in a situation where I need to handle 6 matrices with 1M rows and
> about 2K columns, ie each matrix takes roughly 15GB RAM.

> My (very naive) idea was to use SQLite to store these matrices (via
> RSQLite package). So I have the following:
>
> CREATE TABLE alleleA (sample1 REAL, sample2 REAL <all the way to>
> sample2000 REAL);
>
> When I have the data for sample1, I use an INSERT statement, which
> takes about 4secs.

For all 1 million rows?

> For all the other columns, I use and UPDATE statement, which is taking
> hours (more the 8 now).

Using one UPDATE per row, identifying each individual row with SQLite's rowid?


So to verify: you're building a 1000000 x 2000 matrix by column, and
intend to process it by row afterward.  Is there any way you can make
the two operations use the same direction: either build by row or
process by column?

In physical layout terms, handling a matrix of that size in two
directions is lousy in the efficiency department, as there's no room
for effective caching.  If you consider a storage format that stores a
row at a time sequentially, then the first pass simply writes out rows
of 2K zero values, in order, which is fine.  In disk and OS terms, a
file consists of blocks or pages, and it does I/O and caching in units
of that size.  We'll say 4KB pages for the sake of argument.  So that
first pass collects 4KB of data and writes it to the disk in one shot,
which is about as efficient as you can get.

When you decide to update a column at a time, then it's skipping
around: modify a tiny value on the first page, skip 16000 bytes ahead
(2000x 8byte REAL values) modify another tiny value, etc.  The I/O
cost of this is huge, since you're reading and writing an entire page
just to touch one tiny little value.  Since you never do anything else
with that page, it gets evicted from cache pretty quickly, to make
room for the following pages as you work through the file.

That works out to roughly 4GB of raw I/O for each column updated.
Repeat 2000 times.

If you can't change how you do the operations, you'll have to find
some storage structure that's more efficient.  There are still some
things to try, but it may simply mean discarding SQLite (and most
other relational databases) as inappropriate for the job.

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Reply via email to