Re: [sqlite] indexing

2008-02-13 Thread Kasper Daniel Hansen
On Feb 13, 2008, at 11:12 AM, [EMAIL PROTECTED] wrote:

> Kasper Daniel Hansen <[EMAIL PROTECTED]> wrote:
>> I have a table with two variables, say A and B (both integers). The
>> table is rather large - around 2.9 GB on disk. Every combination of
>> (A,B) occurs only once. I am creating a unique index as
>>CREATE UNIQUE INDEX ABidx ON abtable (A,B)
>> It seems that the (A,B) index is created much slower than the (B,A)
>> index. I am wondering about the reason for this. My - very limited -
>> understanding was that sqlite needs to search over the whole database
>> for every (A,B) combination. Could someone give a quick pointer to
>> where the index strategy is described? Would this be quicker if I fit
>> the whole database into memory?
>>
>
> Creating an index on A,B is equivalent to sorting on A,B.
>
> The sorting algorithm currently used by SQLite requires
> O(NlogN) comparisons, which is optimial.  But it also requires
> O(N) disk seeks, which is very suboptimal.  You don't notice
> all these seeks if your database fits in cache.  But when you
> get into databases of about 3GB, the seeking really slows you
> down.
>
> A project on our to-do list is to implement a new sorter
> that uses O(1) seeks.  We know how to do this.  It is just
> finding time to do the implementation.
>
> If creating an index on B,A is much faster than creating an
> index on A,B, that probably means that B,A is initially closer
> to being in sorted order than A,B is.  The initial order of the
> entries does not effect the number of comparisons in a sort,
> but it does reduce the number of seeks if the values are
> initially close to being sorted.

Thank you for a clear and precise answer.

I seem to recall that it is possible to have in-memory databases with  
sqlite. But perhaps that is only possible with alot of tinkering and  
using the C functions. Is that true (I am thinking of avoiding the I/ 
O bottleneck on our network drive by just moving everything into  
memory since we have a large mem server - but only if it is quick and  
easy)

Kasper
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] indexing

2008-02-13 Thread Kasper Daniel Hansen
I have a table with two variables, say A and B (both integers). The  
table is rather large - around 2.9 GB on disk. Every combination of  
(A,B) occurs only once. I am creating a unique index as
   CREATE UNIQUE INDEX ABidx ON abtable (A,B)
It seems that the (A,B) index is created much slower than the (B,A)  
index. I am wondering about the reason for this. My - very limited -  
understanding was that sqlite needs to search over the whole database  
for every (A,B) combination. Could someone give a quick pointer to  
where the index strategy is described? Would this be quicker if I fit  
the whole database into memory?

Kasper
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SQLite on Mac

2007-07-20 Thread Kasper Daniel Hansen

On Jul 19, 2007, at 8:45 AM, Ahmed Sulaiman wrote:


Hi all,

Does SQLite work on Mac, and if yes, is there any Mac enabled version
that I could download?


SQLite is part of MacOS X. Try typing sqlite3 at the command line...

Kasper



Cheers


-- 
---

To unsubscribe, send email to [EMAIL PROTECTED]
-- 
---





-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] indexing large databases

2007-05-11 Thread Kasper Daniel Hansen

On May 10, 2007, at 11:08 PM, Juri Wichanow wrote:

For "create index.." in large database :  "pragma  
default_cache_size = 2000;"



For "select ..." -- "pragma default_cache_size = 1200;"


Hmm, quite interesting.

I would like to share my naive observations, which led me to believe  
the cache_size was important (although I am now beginning to doubt  
this, see below).


First we run the indexing on server 1 - I am essentially the only  
user, but we only have 2GB of ram. I see (this is using top as is  
everything else I say about performance) that very quickly, the  
memory usage caps out and the cpu usage drops. My conclusion - an I/O  
bottleneck. So I increase cache_size and surely - the program runs  
longer before it seems to bottleneck. So I switch to another server  
(about the same speed, 16GB ram, unknown I/O performance compared to  
the first one), and increase the cache_size dramatically,  
hypothesizing that I could just have the entire db in ram (I know  
think I understand that I will still have to do I/O when writing the  
index). And surely I see no drop in CPU usage over the five days the  
program runs before the server gets rebooted. At this point  
(yesterday) I was convinced that the new server was dramatically  
better. Nevertheless I had never terminated my old runs on server 1.  
And lo and behold, yesterday I accidently discovered they were  
finished in around 7-8 days. This is something that makes no sense to  
me, but it could have something to do with how top measures cpu usage  
or whatever. Or I could have made a mistaken observation.


Well, I am curious now, so I am going to time this carefully on  
various servers (and report my results back to the list of course).


Kasper

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] indexing large databases

2007-05-10 Thread Kasper Daniel Hansen


On May 10, 2007, at 3:04 PM, [EMAIL PROTECTED] wrote:


Kasper Daniel Hansen <[EMAIL PROTECTED]> wrote:

Hi

We are using SQLite for a fairly big (but simple) calculation, and
have some problems when creating an index on a database with
600.000.000 rows. Specifically it has not ended even after 5 days of
running. We have done it successfully on 25% of the full data base,
and are now wondering what we can do.



This is likely a locality of reference problem.  It comes up
about every two weeks on this mailing list.  You can read more
about it by looking at the mailing list archives.

I think I know how to fix the problem.  But I have not yet had
an opportunity to implement the fix.


Thank you for that pointer. I have now read up a bit on the  
references. From what I understand it has to do with the ordering of  
the data and the disk cache and so on.


I could without too much trouble (although it will take some time)  
sort the data in a way such that it is sorted in the right way for  
the index (by right I mean if I create a (VAR1, VAR2) index I want  
VAR2 to be sorted within VAR1 in increasing order). Would that help?  
Or does the UNIQUE keyword make sqlite do some checking that will  
slow down everything.


Having said that, I can understand that sorting and disk cache and so  
on factors in - but my initial database is already very large (1.3GB  
- 145.000.000 milion rows), and surely that disk cache would already  
factor in at that stage?


Thanks, Kasper



-
To unsubscribe, send email to [EMAIL PROTECTED]
-



[sqlite] indexing large databases

2007-05-10 Thread Kasper Daniel Hansen

Hi

We are using SQLite for a fairly big (but simple) calculation, and  
have some problems when creating an index on a database with  
600.000.000 rows. Specifically it has not ended even after 5 days of  
running. We have done it successfully on 25% of the full data base,  
and are now wondering what we can do.


Details:
We have a very simple database, just two columns VAR1 and VAR2 which  
are both integers. Every combination of (VAR1, VAR2) is unique. We  
want to add an index by using the following command

 CREATE UNIQUE INDEX on DATABASE (VAR1, VAR2);
After creating the index we have to do appr. 100.000.000 lookups on  
(VAR1, VAR2). We have benchmarked the results on a small(er) database  
with 145.000.000 rows where the indexing takes around 1 hour and the  
speed of the lookup is _very_ acceptable. However when running the  
indexing command on the full database it does not terminate even  
after 5-6 days (unfortunately our server got rebooted while it was  
running). We are wondering on the rather drastic increase in run time  
when going from 145.000.000 rows to 600.000.000 rows.


?: does the unique keyword matter, ie. does it make lookups faster  
and does it slow down the index creation.
?: does the order of the data matter a lot - right now it is  
partially sorted in that all combinations involving a single value of  
VAR2 (not VAR1) are next to each other.
?: is it possible to enable some kind of logging/debugging mode so we  
can monitor progression and see how it is doing.


We run the program on an 2600mhz Opteron with 16GB of ram and so far  
it seems we are not getting killed by I/O or RAM: using top  
frequently shows around 99% cpu usage and less than 40% memory usage.  
We are using the following program (including all pragmas):


sqlite3 blasthits.db "pragma default_cache_size = 1200;"
sqlite3 blasthits.db "pragma synchronous = off;
  pragma temp_store = 2;
  create unique index probeseqIDX on blasthits (probeidx,prokMSA);"
(here blashits.db is the database name and probeidx, prokMSA are  
VAR1, VAR2).


?: Is there anything we can do to speed it up further?

Thanks a lot for any help, even if it is "it does not seem to be  
possible to improve the speed of this operation".


Jim and Kasper

-
To unsubscribe, send email to [EMAIL PROTECTED]
-