Re: [sqlite] indexing large databases

2007-05-11 Thread Kasper Daniel Hansen

On May 10, 2007, at 11:08 PM, Juri Wichanow wrote:

For "create index.." in large database :  "pragma  
default_cache_size = 2000;"



For "select ..." -- "pragma default_cache_size = 1200;"


Hmm, quite interesting.

I would like to share my naive observations, which led me to believe  
the cache_size was important (although I am now beginning to doubt  
this, see below).


First we run the indexing on server 1 - I am essentially the only  
user, but we only have 2GB of ram. I see (this is using top as is  
everything else I say about performance) that very quickly, the  
memory usage caps out and the cpu usage drops. My conclusion - an I/O  
bottleneck. So I increase cache_size and surely - the program runs  
longer before it seems to bottleneck. So I switch to another server  
(about the same speed, 16GB ram, unknown I/O performance compared to  
the first one), and increase the cache_size dramatically,  
hypothesizing that I could just have the entire db in ram (I know  
think I understand that I will still have to do I/O when writing the  
index). And surely I see no drop in CPU usage over the five days the  
program runs before the server gets rebooted. At this point  
(yesterday) I was convinced that the new server was dramatically  
better. Nevertheless I had never terminated my old runs on server 1.  
And lo and behold, yesterday I accidently discovered they were  
finished in around 7-8 days. This is something that makes no sense to  
me, but it could have something to do with how top measures cpu usage  
or whatever. Or I could have made a mistaken observation.


Well, I am curious now, so I am going to time this carefully on  
various servers (and report my results back to the list of course).


Kasper

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



[sqlite] indexing large databases

2007-05-11 Thread Juri Wichanow
For "create index.." in large database :  "pragma default_cache_size = 2000;" 

For "select ..." -- "pragma default_cache_size = 1200;" 

Juri


Re: [sqlite] indexing large databases

2007-05-10 Thread drh
Kasper Daniel Hansen <[EMAIL PROTECTED]> wrote:
> 
> Having said that, I can understand that sorting and disk cache and so  
> on factors in - but my initial database is already very large (1.3GB  
> - 145.000.000 milion rows), and surely that disk cache would already  
> factor in at that stage?
> 

As long as your table does not contain out-of-order indices,
each new row will be inserted at the end.  So the last few
disk pages will be in cache and everything will go quickly.
It is when you have to start inserting at random places in
the middle of the database - on disk pages that are not
cached - that things slow down.

--
D. Richard Hipp <[EMAIL PROTECTED]>


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] indexing large databases

2007-05-10 Thread Kasper Daniel Hansen


On May 10, 2007, at 3:04 PM, [EMAIL PROTECTED] wrote:


Kasper Daniel Hansen <[EMAIL PROTECTED]> wrote:

Hi

We are using SQLite for a fairly big (but simple) calculation, and
have some problems when creating an index on a database with
600.000.000 rows. Specifically it has not ended even after 5 days of
running. We have done it successfully on 25% of the full data base,
and are now wondering what we can do.



This is likely a locality of reference problem.  It comes up
about every two weeks on this mailing list.  You can read more
about it by looking at the mailing list archives.

I think I know how to fix the problem.  But I have not yet had
an opportunity to implement the fix.


Thank you for that pointer. I have now read up a bit on the  
references. From what I understand it has to do with the ordering of  
the data and the disk cache and so on.


I could without too much trouble (although it will take some time)  
sort the data in a way such that it is sorted in the right way for  
the index (by right I mean if I create a (VAR1, VAR2) index I want  
VAR2 to be sorted within VAR1 in increasing order). Would that help?  
Or does the UNIQUE keyword make sqlite do some checking that will  
slow down everything.


Having said that, I can understand that sorting and disk cache and so  
on factors in - but my initial database is already very large (1.3GB  
- 145.000.000 milion rows), and surely that disk cache would already  
factor in at that stage?


Thanks, Kasper



-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] indexing large databases

2007-05-10 Thread drh
Kasper Daniel Hansen <[EMAIL PROTECTED]> wrote:
> Hi
> 
> We are using SQLite for a fairly big (but simple) calculation, and  
> have some problems when creating an index on a database with  
> 600.000.000 rows. Specifically it has not ended even after 5 days of  
> running. We have done it successfully on 25% of the full data base,  
> and are now wondering what we can do.
> 

This is likely a locality of reference problem.  It comes up
about every two weeks on this mailing list.  You can read more
about it by looking at the mailing list archives.

I think I know how to fix the problem.  But I have not yet had
an opportunity to implement the fix.

--
D. Richard Hipp <[EMAIL PROTECTED]>


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



[sqlite] indexing large databases

2007-05-10 Thread Kasper Daniel Hansen

Hi

We are using SQLite for a fairly big (but simple) calculation, and  
have some problems when creating an index on a database with  
600.000.000 rows. Specifically it has not ended even after 5 days of  
running. We have done it successfully on 25% of the full data base,  
and are now wondering what we can do.


Details:
We have a very simple database, just two columns VAR1 and VAR2 which  
are both integers. Every combination of (VAR1, VAR2) is unique. We  
want to add an index by using the following command

 CREATE UNIQUE INDEX on DATABASE (VAR1, VAR2);
After creating the index we have to do appr. 100.000.000 lookups on  
(VAR1, VAR2). We have benchmarked the results on a small(er) database  
with 145.000.000 rows where the indexing takes around 1 hour and the  
speed of the lookup is _very_ acceptable. However when running the  
indexing command on the full database it does not terminate even  
after 5-6 days (unfortunately our server got rebooted while it was  
running). We are wondering on the rather drastic increase in run time  
when going from 145.000.000 rows to 600.000.000 rows.


?: does the unique keyword matter, ie. does it make lookups faster  
and does it slow down the index creation.
?: does the order of the data matter a lot - right now it is  
partially sorted in that all combinations involving a single value of  
VAR2 (not VAR1) are next to each other.
?: is it possible to enable some kind of logging/debugging mode so we  
can monitor progression and see how it is doing.


We run the program on an 2600mhz Opteron with 16GB of ram and so far  
it seems we are not getting killed by I/O or RAM: using top  
frequently shows around 99% cpu usage and less than 40% memory usage.  
We are using the following program (including all pragmas):


sqlite3 blasthits.db "pragma default_cache_size = 1200;"
sqlite3 blasthits.db "pragma synchronous = off;
  pragma temp_store = 2;
  create unique index probeseqIDX on blasthits (probeidx,prokMSA);"
(here blashits.db is the database name and probeidx, prokMSA are  
VAR1, VAR2).


?: Is there anything we can do to speed it up further?

Thanks a lot for any help, even if it is "it does not seem to be  
possible to improve the speed of this operation".


Jim and Kasper

-
To unsubscribe, send email to [EMAIL PROTECTED]
-