RE: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS
>From: D. Richard Hipp [mailto:[EMAIL PROTECTED] > -- snip -- >simplifies to O(logN) which is clearly less than O(N). >In that case, it pays to use the index. Which is my case I believe, thanks. It's been years (OMG, 16!) since I had an algorithms class. Is that log base 2, or does it matter? Note the index in the example had two fields: CREATE INDEX Articles_Index ON Articles (MessageID, DomainID); The messageid is almost unique, very rarely (every few years?) would it be duplicated. The domainid by itself will have many duplicate records, it is the name of the server that originated the article. Together they are unique. By the way, as a brute force experiment I tried an index with only one field of: CREATE INDEX Articles_Index ON Articles (MessageID); And, keeping N large, got a 30% speed improvement. I wonder why? Does this indicate one could trade some file size (DomainID is a key into another small table holding the domain string) for a speed improvement by concatenating the two strings and having only one field? Also back to one of my original questions: Can the table be kept in one file and the index be kept in another? The idea would be to localize the index and suspects it currently is spread throughout the database. TIA again.
RE: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS
At 5:05 PM -0700 6/21/04, Keith Herold wrote: > down the result set would make things faster..? Wouldn't the select here: CREATE TABLE tmp ( flag boolean, name text ); SELECT name FROM tmp WHERE flag = 1 AND name LIKE '%foo%'; run faster with an index on the flag column since it can scan just the flag = 1 rows instead of the full table? I think this is one of those big-O things, right? It might be faster, but only a bit faster, and not enough to justify the hassle of creating and maintaining the index. --Keith A lot of optimizing is context specific. Certain kinds of optimizations work well on some sets of data, but can make things worse with others. The index being a type of optimization. On some data sets, indexes can make things a little faster, and on others, orders of magnitude faster; the latter in particular is where you want to target them. The main time indexes are useful is when you only want to return a tiny fraction of the records in a table, in a typical select. For example, use an index when you want to fetch, say, 1 row out of a table with 1000. But if you want to fetch 200 rows out of the same table, then not having an index will be faster for fetches. And you save time by not creating the index too. When you use an index, the overhead of storing rows is definately greater, and the overhead of fetching is also greater. However, that overhead is only on the rows actually returned for a fetch, so you still have a huge savings with an index if you return just a few rows. But if you are selecting a lot of rows, such as the 200, then the overhead of finding them first in the index leads to more work being done than if the index was ignored and the same 200 were found with a simple table scan. One of the best places to have indexes is usually on columns where every value is unique, regardless of how many records you have in the table. But then, if you have a unique or primary key constraint on the column, then an index is created implicitely anyway, as it is used when enforcing the constraint. -- Darren Duncan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS
> On Jun 20, 2004, at 9:07 PM, Darren Duncan wrote: > down the result set would make things faster..? Wouldn't the select > here: > >CREATE TABLE tmp ( flag boolean, name text ); > >SELECT name FROM tmp WHERE flag = 1 AND name LIKE '%foo%'; > > run faster with an index on the flag column since it can scan > just the > flag = 1 rows instead of the full table? I think this is one of those big-O things, right? It might be faster, but only a bit faster, and not enough to justify the hassle of creating and maintaining the index. --Keith - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS
On Jun 20, 2004, at 9:07 PM, Darren Duncan wrote: Generally speaking, you should only use indexes on table columns that have a lot of distinct values, and each one only appears a few times. You should not use indexes on columns that have few distinct values and each appears many times; in the latter case, a full table scan would be faster. That's weird. I would have thought that having any index at all to pare down the result set would make things faster..? Wouldn't the select here: CREATE TABLE tmp ( flag boolean, name text ); SELECT name FROM tmp WHERE flag = 1 AND name LIKE '%foo%'; run faster with an index on the flag column since it can scan just the flag = 1 rows instead of the full table? -Dave - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS
Something else I should mention is in regard to good design decisions. This information may actually be more helpful to you. With any given database engine, there are good times to use indexes and bad times, as far as speed and space optimization goes. Generally speaking, you should only use indexes on table columns that have a lot of distinct values, and each one only appears a few times. You should not use indexes on columns that have few distinct values and each appears many times; in the latter case, a full table scan would be faster. I think a suggested borderline is at 5%. If each distinct column value always/usually occurs in less than 5% of the records, then index the column; if each distinct value usually occurs in more than 5% of the records, then do not index the column. The actual best borderline may be different, but it's in that ballpark. And don't forget, this rule is per column. You can do it in some tables and not others. Also, you should never index a column you won't search on. Note that if you have a unique key constraint on a column, then it will be implicitely indexed. My comments only apply to explicit indexes. -- Darren Duncan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS
> > > >No indexes: 21 sec. >Indexing while inserting: 50 sec. >Indexing after inserting: 37 sec. > You're right; creating the index after inserting is faster. Indeed, I tried that out. With a larger set of data the gap is even more dramatic. More benchmarks, taken from the same data dumped from a 3.7GB file being loaded via sqlite.exe, 10,602,743 rows: Case 1: Just inserts: BEGIN TRANSACTION; CREATE TABLE ... INSERT ... -- etc -- INSERT ... COMMIT; 20 min Case 2: Inserts first, index after: BEGIN TRANSACTION; CREATE TABLE ... INSERT ... -- etc -- INSERT ... CREATE INDEX ... COMMIT; 2 hrs 30 min Case 3: Index first, inserts after: BEGIN TRANSACTION; CREATE TABLE ... CREATE INDEX ... INSERT ... -- etc -- INSERT ... COMMIT; 16 hrs 20 min Case 4: Index first, inserts after, not in a transaction: CREATE TABLE ... CREATE INDEX ... INSERT ... -- etc -- INSERT ... 26 hrs 58 min As you may have guessed from the table definition this data is from a USENET newsreader client. It is storing article header information coming in from multiple internet servers via multiple TCP connections run by multiple threads. The same header will come in multiple times from a particular server, and also from multiple servers. Part of the record being stored is how many servers it is on, so you have to search the table to find and update it while data is being inserted. Thus, not defining the index first is probably not a good idea. The newsreader client is operating somewhere between case 3 and case 4, which remember are benchmarks using the single threaded sqlite.exe app. Normally, every few days you would query the news servers about their current content, delete the rows that have "expired" and add data for new articles. Alas, the way it's working now it is so slow it is faster to delete the database file and reload the whole 3.6GB of data over the internet at 380 KB/sec cable modem speeds. I'm still interested in what resource is limiting the operation. My random access rate for my disk is about 5 MB/sec. Data is being stored at about 50 KB/sec. Does this mean each INSERT is accessing the index data 100 times each record? (Who can calculate the number of nodes touched on the average in a Btree to do this with 10,602,743 records?) Assuming it is jumping around in such a way that both the disk and DB caches are being defeated. Can the index be configured to be kept more in memory? Can it be kept in a different file and would this help (would the index info be more localized to fill up whole pages)? The index appears to represent 600 MB of file data. Has there been any evidence that selecting indexed data is faster if the records are inserted first and indexed later?
Re: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS
Hello, your best bet is to first declare the database without indexes, then load all the data, then add the indexes afterwards. Indeed. Just for the kicks, I've tested with 100K records. These are the results: No indexes: 21 sec. Indexing while inserting: 50 sec. Indexing after inserting: 37 sec. -- Tito On Jun 20, 2004, at 23:38, Darren Duncan wrote: At 4:42 PM -0400 6/20/04, Edward Bacon wrote: The operation starts off loading data at about 500K/sec, but drops to about 50K/sec over 30 minutes and stays around there. (If it comes through, see attached) At that point about 1.2GB of the 3.7GB has loaded. It takes about 16 hours to complete the full 3.7GB. Without the index, the whole 3.7GB loads in about 20 minutes. Can the index be configured to be kept more in memory? Can it be kept in a different file and would this help (I suspect if and only if the file is on a different physical disk)? Since huge loads like this aren't very common in the life of a single database, your best bet is to first declare the database without indexes, then load all the data, then add the indexes afterwards. I believe the slowdowns you are experiencing is from the huge number of updates to the index itself, including constant restructuring. The index would be built a lot more cleanly if it was done after all/most of the data was in place, I believe. -- Darren Duncan P.S. I think this same tip will optimize any brand of database. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [sqlite] FW: 2.8.14 performance, Indexed Table INSERTS
I know the insert into an indexed table performance has been discussed here before, but I wanted to add to it and ask some questions. I'm benchmarking loading insert statements, dumped from a 3.7GB file, into an empty database with sqlite.exe: BEGIN TRANSACTION; CREATE TABLE Articles(RID integer primary key, RefCnt integer, MessageID text, DomainID integer, Date integer, Lines integer, Flags integer, PartSize integer, FileSize integer, SetSize integer, CurPart integer, MaxPart integer, CurFile integer, MaxFile integer, Subject integer, Author integer, FileName integer, DlDir integer, PrevPart integer, NextPart integer, Score integer); CREATE INDEX Articles_Index ON Articles (MessageID, DomainID); INSERT INTO Articles VALUES(1,3,'dXOuc.4884$pX3.1156',1,817862976,2521,1328512,317646,1049920 2,0,2,33,0,0,1,1,1,0,109099,109100,0); INSERT INTO Articles VALUES(2,3,'z3Puc.4938$pX3.1969',1,817863259,2524,1328512,318024,1052301 6,0,22,33,0,0,1,1,2,0,109676,109701,0); -- etc -- INSERT INTO Articles VALUES(10602742,1,'8OWdnQHIA84fFSLdRVn-uA',11,818109133,3050,1327744,384 300,384300,0,1,40,23,36,84580,3412,243311,0,0,10602743,0); INSERT INTO Articles VALUES(10602743,1,'8OWdnQDIA84aFSLdRVn-uA',11,818109135,3051,1065600,384 426,384426,0,2,40,23,36,84580,3412,243311,0,10602742,0,0); COMMIT; The operation starts off loading data at about 500K/sec, but drops to about 50K/sec over 30 minutes and stays around there. (If it comes through, see attached) At that point about 1.2GB of the 3.7GB has loaded. It takes about 16 hours to complete the full 3.7GB. Without the index, the whole 3.7GB loads in about 20 minutes. CPU starts at 100% but soon starts to come down. I expect my hard drives to be able to deliver about 50MB/sec sequentially, about 5MB/sec random access, and have a 8-10 msec seek time. Memory allocation is stable, sqlite.exe is not paging, and the thread is either "running" or "waiting for executive". Using taskinfo2003, the only files with activity are the input sql file and the output sqlite DB file, the temp and journal files are quite with no movement. The file position for the sql input file grows linearly toward it's end. The file position for the sqlite DB file. As the DB grows, you can see the localization deteriorate by watching the file position. Taskinfo can give 1/2 second samples. It starts out spending most of it's time growing the file at the end. As time goes on, it starts jumping to positions back in the file. After 30 min or so, it is spending most of its time jumping around. I presume it is looking for where new records belong in the index ordering. So, what is the limiting resource? The seek time? Can the index be configured to be kept more in memory? Can it be kept in a different file and would this help (I suspect if and only if the file is on a different physical disk)? TIA - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]