Unfortunately the matrix is not sparse nor the direction of operations
can be changed.

The reason is that I have 2K samples, which are given in one file each
(2K binary files). Each of the files has roughly 7M numbers, which I
process by running some algorithms I have, reducing from 7M to 1M
elements.

The best scenario I see here is dumping these data somewhere before
proceeding to the next sample... This processing is in a sample by
sample (column) basis.... After the process is done, the second step
is done on a row by row basis (actually batches of rows, so reads are
minimized and there's still enough RAM to perform the computations I
need).

I have "fixed" my problem, working late night and not sleeping usually
causes that: I missed a WHERE on the UPDATE statement, so I could
modify multiple rows at a time. Basically what I did was create a
primary key (integer), which is the "row number of my matrix" and run
the UPDATE like:

UPDATE table SET column = <values> WHERE id = <ids>;

shame on me... and I apologize for missing such basic thing...

But, anyways, the first INSERT takes 4secs for all 1M rows.. the
updates take 9secs, each.

I'm in the process of creating the whole db now and will give it a
try... maybe creation is the worse part, as I'm going to be accessing
contiguous sets of rows (all columns) at a time... something like:

SELECT * FROM table LIMIT 10000 OFFSET 20000;

Any recommendation/suggestion is welcome... I very much appreciate...

Best,

Benilton

On Nov 13, 2007 6:31 PM, Trevor Talbot <[EMAIL PROTECTED]> wrote:
> On 11/13/07, John Stanton <[EMAIL PROTECTED]> wrote:
>
> > The point is that the more statements you have in each transaction the
> > better, up to a limit of perhaps a thousand or so.
>
> Actually, I get the impression that number of statements is not really
> the problem here.
>
>
> On 11/13/07, Benilton Carvalho <[EMAIL PROTECTED]> wrote:
>
> > I'm in a situation where I need to handle 6 matrices with 1M rows and
> > about 2K columns, ie each matrix takes roughly 15GB RAM.
>
> > My (very naive) idea was to use SQLite to store these matrices (via
> > RSQLite package). So I have the following:
> >
> > CREATE TABLE alleleA (sample1 REAL, sample2 REAL <all the way to>
> > sample2000 REAL);
> >
> > When I have the data for sample1, I use an INSERT statement, which
> > takes about 4secs.
>
> For all 1 million rows?
>
> > For all the other columns, I use and UPDATE statement, which is taking
> > hours (more the 8 now).
>
> Using one UPDATE per row, identifying each individual row with SQLite's rowid?
>
>
> So to verify: you're building a 1000000 x 2000 matrix by column, and
> intend to process it by row afterward.  Is there any way you can make
> the two operations use the same direction: either build by row or
> process by column?
>
> In physical layout terms, handling a matrix of that size in two
> directions is lousy in the efficiency department, as there's no room
> for effective caching.  If you consider a storage format that stores a
> row at a time sequentially, then the first pass simply writes out rows
> of 2K zero values, in order, which is fine.  In disk and OS terms, a
> file consists of blocks or pages, and it does I/O and caching in units
> of that size.  We'll say 4KB pages for the sake of argument.  So that
> first pass collects 4KB of data and writes it to the disk in one shot,
> which is about as efficient as you can get.
>
> When you decide to update a column at a time, then it's skipping
> around: modify a tiny value on the first page, skip 16000 bytes ahead
> (2000x 8byte REAL values) modify another tiny value, etc.  The I/O
> cost of this is huge, since you're reading and writing an entire page
> just to touch one tiny little value.  Since you never do anything else
> with that page, it gets evicted from cache pretty quickly, to make
> room for the following pages as you work through the file.
>
> That works out to roughly 4GB of raw I/O for each column updated.
> Repeat 2000 times.
>
> If you can't change how you do the operations, you'll have to find
> some storage structure that's more efficient.  There are still some
> things to try, but it may simply mean discarding SQLite (and most
> other relational databases) as inappropriate for the job.
>
>
> -----------------------------------------------------------------------------
> To unsubscribe, send email to [EMAIL PROTECTED]
> -----------------------------------------------------------------------------
>
>

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Reply via email to