RE: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS

2004-06-21 Thread Edward Bacon
 
>From: D. Richard Hipp [mailto:[EMAIL PROTECTED] 
>
-- snip --
>simplifies to O(logN) which is clearly less than O(N).
>In that case, it pays to use the index.
 
Which is my case I believe, thanks.  It's been years (OMG, 16!) since I
had an algorithms class.  Is that log base 2, or does it matter?
 
Note the index in the example had two fields:
 
CREATE INDEX Articles_Index ON Articles (MessageID, DomainID);
 
The messageid is almost unique, very rarely (every few years?) would it
be duplicated.  The domainid by itself will have many duplicate records,
it is the name of the server that originated the article.  Together they
are unique.
 
By the way, as a brute force experiment I tried an index with only one
field of:
 
CREATE INDEX Articles_Index ON Articles (MessageID);
 
And, keeping N large, got a 30% speed improvement.  I wonder why?  Does
this indicate one could trade some file size (DomainID is a key into
another small table holding the domain string) for a speed improvement
by concatenating the two strings and having only one field?
 
Also back to one of my original questions: Can the table be kept in one
file and the index be kept in another?  The idea would be to localize
the index and suspects it currently is spread throughout the database.
 
TIA again.


RE: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS

2004-06-21 Thread Darren Duncan
At 5:05 PM -0700 6/21/04, Keith Herold wrote:
 > down the result set would make things faster..? Wouldn't the select
 here:
CREATE TABLE tmp ( flag boolean, name text );
SELECT name FROM tmp WHERE flag = 1 AND name LIKE '%foo%';
 run faster with an index on the flag column since it can scan
 just the
 flag = 1 rows instead of the full table?
I think this is one of those big-O things, right?  It might be faster, but
only a bit faster, and  not enough to justify the hassle of creating and
maintaining the index.
--Keith
A lot of optimizing is context specific.  Certain kinds of 
optimizations work well on some sets of data, but can make things 
worse with others.  The index being a type of optimization.  On some 
data sets, indexes can make things a little faster, and on others, 
orders of magnitude faster; the latter in particular is where you 
want to target them.

The main time indexes are useful is when you only want to return a 
tiny fraction of the records in a table, in a typical select.  For 
example, use an index when you want to fetch, say, 1 row out of a 
table with 1000.  But if you want to fetch 200 rows out of the same 
table, then not having an index will be faster for fetches.  And you 
save time by not creating the index too.

When you use an index, the overhead of storing rows is definately 
greater, and the overhead of fetching is also greater.  However, that 
overhead is only on the rows actually returned for a fetch, so you 
still have a huge savings with an index if you return just a few 
rows.  But if you are selecting a lot of rows, such as the 200, then 
the overhead of finding them first in the index leads to more work 
being done than if the index was ignored and the same 200 were found 
with a simple table scan.

One of the best places to have indexes is usually on columns where 
every value is unique, regardless of how many records you have in the 
table.  But then, if you have a unique or primary key constraint on 
the column, then an index is created implicitely anyway, as it is 
used when enforcing the constraint.

-- Darren Duncan
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS

2004-06-21 Thread Keith Herold
> On Jun 20, 2004, at 9:07 PM, Darren Duncan wrote:



> down the result set would make things faster..? Wouldn't the select 
> here:
> 
>CREATE TABLE tmp ( flag boolean, name text );
> 
>SELECT name FROM tmp WHERE flag = 1 AND name LIKE '%foo%';
> 
> run faster with an index on the flag column since it can scan 
> just the 
> flag = 1 rows instead of the full table?

I think this is one of those big-O things, right?  It might be faster, but
only a bit faster, and  not enough to justify the hassle of creating and
maintaining the index.

--Keith


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS

2004-06-21 Thread Dave Hayden
On Jun 20, 2004, at 9:07 PM, Darren Duncan wrote:
Generally speaking, you should only use indexes on table columns that 
have a lot of distinct values, and each one only appears a few times. 
You should not use indexes on columns that have few distinct values 
and each appears many times; in the latter case, a full table scan 
would be faster.
That's weird. I would have thought that having any index at all to pare 
down the result set would make things faster..? Wouldn't the select 
here:

  CREATE TABLE tmp ( flag boolean, name text );
  SELECT name FROM tmp WHERE flag = 1 AND name LIKE '%foo%';
run faster with an index on the flag column since it can scan just the 
flag = 1 rows instead of the full table?

-Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS

2004-06-20 Thread Darren Duncan
Something else I should mention is in regard to good design 
decisions.  This information may actually be more helpful to you.

With any given database engine, there are good times to use indexes 
and bad times, as far as speed and space optimization goes.

Generally speaking, you should only use indexes on table columns that 
have a lot of distinct values, and each one only appears a few times. 
You should not use indexes on columns that have few distinct values 
and each appears many times; in the latter case, a full table scan 
would be faster.

I think a suggested borderline is at 5%.  If each distinct column 
value always/usually occurs in less than 5% of the records, then 
index the column; if each distinct value usually occurs in more than 
5% of the records, then do not index the column.

The actual best borderline may be different, but it's in that ballpark.
And don't forget, this rule is per column.  You can do it in some 
tables and not others.  Also, you should never index a column you 
won't search on.

Note that if you have a unique key constraint on a column, then it 
will be implicitely indexed.  My comments only apply to explicit 
indexes.

-- Darren Duncan
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS

2004-06-20 Thread Edward Bacon
>
>
>
>No indexes: 21 sec.
>Indexing while inserting: 50 sec.
>Indexing after inserting: 37 sec.
>
 
You're right; creating the index after inserting is faster.  Indeed, I
tried that out.  With a larger set of data the gap is even more
dramatic.
 
More benchmarks, taken from the same data dumped from a 3.7GB file being
loaded via sqlite.exe, 10,602,743 rows:
 
Case 1: Just inserts:
 
BEGIN TRANSACTION;
CREATE TABLE ...
INSERT ...
-- etc --
INSERT ...
COMMIT;
 
20 min
 
Case 2: Inserts first, index after:
 
BEGIN TRANSACTION;
CREATE TABLE ...
INSERT ...
-- etc --
INSERT ...
CREATE INDEX ...
COMMIT;
 
2 hrs 30 min
 
Case 3: Index first, inserts after:
 
BEGIN TRANSACTION;
CREATE TABLE ...
CREATE INDEX ...
INSERT ...
-- etc --
INSERT ...
COMMIT;
 
16 hrs 20 min
 
Case 4: Index first, inserts after, not in a transaction:
 
CREATE TABLE ...
CREATE INDEX ...
INSERT ...
-- etc --
INSERT ...
 
26 hrs 58 min
 
As you may have guessed from the table definition this data is from a
USENET newsreader client.  It is storing article header information
coming in from multiple internet servers via multiple TCP connections
run by multiple threads.  The same header will come in multiple times
from a particular server, and also from multiple servers.  Part of the
record being stored is how many servers it is on, so you have to search
the table to find and update it while data is being inserted.  Thus, not
defining the index first is probably not a good idea.
 
The newsreader client is operating somewhere between case 3 and case 4,
which remember are benchmarks using the single threaded sqlite.exe app.
Normally, every few days you would query the news servers about their
current content, delete the rows that have "expired" and add data for
new articles.  Alas, the way it's working now it is so slow it is faster
to delete the database file and reload the whole 3.6GB of data over the
internet at 380 KB/sec cable modem speeds.
 
I'm still interested in what resource is limiting the operation.  My
random access rate for my disk is about 5 MB/sec.  Data is being stored
at about 50 KB/sec.  Does this mean each INSERT is accessing the index
data 100 times each record?  (Who can calculate the number of nodes
touched on the average in a Btree to do this with 10,602,743 records?)
Assuming it is jumping around in such a way that both the disk and DB
caches are being defeated.
 
Can the index be configured to be kept more in memory?  Can it be kept
in a different file and would this help (would the index info be more
localized to fill up whole pages)?  The index appears to represent 600
MB of file data.
 
Has there been any evidence that selecting indexed data is faster if the
records are inserted first and indexed later?


Re: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS

2004-06-20 Thread Tito Ciuro
Hello,
your best bet is to first declare the database without indexes, then 
load all the data, then add the indexes afterwards.
Indeed.
Just for the kicks, I've tested with 100K records. These are the 
results:

No indexes: 21 sec.
Indexing while inserting: 50 sec.
Indexing after inserting: 37 sec.
-- Tito
On Jun 20, 2004, at 23:38, Darren Duncan wrote:
At 4:42 PM -0400 6/20/04, Edward Bacon wrote:
The operation starts off loading data at about 500K/sec, but drops to
about 50K/sec over 30 minutes and stays around there. (If it comes
through, see attached)  At that point about 1.2GB of the 3.7GB has
loaded.  It takes about 16 hours to complete the full 3.7GB.
Without the index, the whole 3.7GB loads in about 20 minutes.
Can the index be configured to be kept more in memory?  Can it be kept
in a different file and would this help (I suspect if and only if the
file is on a different physical disk)?
Since huge loads like this aren't very common in the life of a single 
database, your best bet is to first declare the database without 
indexes, then load all the data, then add the indexes afterwards.  I 
believe the slowdowns you are experiencing is from the huge number of 
updates to the index itself, including constant restructuring.  The 
index would be built a lot more cleanly if it was done after all/most 
of the data was in place, I believe. -- Darren Duncan
P.S. I think this same tip will optimize any brand of database.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS

2004-06-20 Thread Edward Bacon
I know the insert into an indexed table performance has been discussed
here before, but I wanted to add to it and ask some questions.

I'm benchmarking loading insert statements, dumped from a 3.7GB file,
into an empty database with sqlite.exe:

BEGIN TRANSACTION;
CREATE TABLE Articles(RID integer primary key, RefCnt integer, MessageID
text, DomainID integer, Date integer, Lines integer, Flags integer,
PartSize integer, FileSize integer, SetSize integer, CurPart integer,
MaxPart integer, CurFile integer, MaxFile integer, Subject integer,
Author integer, FileName integer, DlDir integer, PrevPart integer,
NextPart integer, Score integer);
CREATE INDEX Articles_Index ON Articles (MessageID, DomainID);
INSERT INTO Articles
VALUES(1,3,'dXOuc.4884$pX3.1156',1,817862976,2521,1328512,317646,1049920
2,0,2,33,0,0,1,1,1,0,109099,109100,0);
INSERT INTO Articles
VALUES(2,3,'z3Puc.4938$pX3.1969',1,817863259,2524,1328512,318024,1052301
6,0,22,33,0,0,1,1,2,0,109676,109701,0);
-- etc --
INSERT INTO Articles
VALUES(10602742,1,'8OWdnQHIA84fFSLdRVn-uA',11,818109133,3050,1327744,384
300,384300,0,1,40,23,36,84580,3412,243311,0,0,10602743,0);
INSERT INTO Articles
VALUES(10602743,1,'8OWdnQDIA84aFSLdRVn-uA',11,818109135,3051,1065600,384
426,384426,0,2,40,23,36,84580,3412,243311,0,10602742,0,0);
COMMIT;

The operation starts off loading data at about 500K/sec, but drops to
about 50K/sec over 30 minutes and stays around there. (If it comes
through, see attached)  At that point about 1.2GB of the 3.7GB has
loaded.  It takes about 16 hours to complete the full 3.7GB.

Without the index, the whole 3.7GB loads in about 20 minutes.

CPU starts at 100% but soon starts to come down.  I expect my hard
drives to be able to deliver about 50MB/sec sequentially, about 5MB/sec
random access, and have a 8-10 msec seek time.  Memory allocation is
stable, sqlite.exe is not paging, and the thread is either "running" or
"waiting for executive".  Using taskinfo2003, the only files with
activity are the input sql file and the output sqlite DB file, the temp
and journal files are quite with no movement.  The file position for the
sql input file grows linearly toward it's end.  The file position for
the sqlite DB file.

As the DB grows, you can see the localization deteriorate by watching
the file position.  Taskinfo can give 1/2 second samples.  It starts out
spending most of it's time growing the file at the end.  As time goes
on, it starts jumping to positions back in the file.  After 30 min or
so, it is spending most of its time jumping around.  I presume it is
looking for where new records belong in the index ordering.

So, what is the limiting resource?  The seek time?

Can the index be configured to be kept more in memory?  Can it be kept
in a different file and would this help (I suspect if and only if the
file is on a different physical disk)?


TIA


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]