My first reaction to all this was...why use SQL to do this? Then after I saw 333450 results I was wondering if that was accurate. Out of 10,000 records there should be 384.62 average name matches The odds of being < 10 with 0-99 position is approx 10000*.1 This gives us 384.62*(10000-384.62)*0.1 or 369,822 matches -- it will be a little less due to the edge conditions. I'm averaging 363,000 or so matches. That's a 10% difference. I've got 10,000 records with 'a'-'z' name and 0-99 position. So here's C++ that runs in 0.068 seconds -- about 34 times faster than the fastest query shown. #include <iostream> #include <vector> using namespace std; #define N 10000 #define POS 100 #define DX 10 struct s_datum { unsigned char name; unsigned char pos; } datum; int main (int argc, char *argv[]) { vector < struct s_datum >v; srandom (time (NULL)); for (int i = 0; i < N; i++) { datum.name = (random () % 26) + 'a'; datum.pos = random () % POS; v.push_back (datum); //cout << datum.name << "," << (int)datum.pos << endl; } int match = 0; for (int i = 0; i < 10000; i++) { for (int j = i; j < N; j++) { if (v[i].name != v[j].name) { continue; } int dx = abs (v[i].pos - v[j].pos); if (dx != 0 && dx <= DX) { match++; // cout << v[i].name << "," << (int)v[i].pos << "," << v[j].name << "," << (int)v[j].pos << "," "dx=" << dx << endl; } } } cout << match << endl; }
Michael D. Black Senior Scientist Advanced Analytics Directorate Northrop Grumman Information Systems ________________________________ From: sqlite-users-boun...@sqlite.org on behalf of ve3meo Sent: Sat 8/21/2010 4:25 PM To: sqlite-users@sqlite.org Subject: EXTERNAL:Re: [sqlite] What is the most efficient way to gettheclosebynumbers? "Simon Slavin" ... > >> Here's how I have interpreted Simon's suggested chunky query: >> >> select * from A as A1, A as A2 where A1.name=A2.name and >> A1.position != A2.position and >> A2.chunk between A1.chunk - 1 and A1.chunk + 1 and >> A2.position between A1.position - 10 and A1.position + 10 ; > > That's exactly what I meant. It allows you to reject everything not in > those three chunks immediately, before even working out what 'position - > 10' and 'position + 10' are. But looking again at your formulation with > 'between' in ... > >> select * from A as A1, A as A2 where A1.name=A2.name and >> A1.position != A2.position and >> A2.position between A1.position - 10 and A1.position + 10 ; > > I think that that might be just as fast, and not require you to work out > the chunks, or take up all the space the chunk indexes will take up, or > the time it will take to generate them. So we're back to the same old > song: Try the simpler solution. If it's fast enough, use it. Only if it > isn't fast enough is it worth optimizing. > > By the way, you might find that swapping the last two lines makes it > faster: > > >> select * from A as A1, A as A2 where A1.name=A2.name and >> A2.position between A1.position - 10 and A1.position + 10 and >> A1.position != A2.position ; > > But you might not: SQLite's optimizer may already have spotted it. I did some tests on a 10,000 row table similar to Peng's example with name a random alpha and random position 0-99, distance 10: each query returned 333,450 results. ~62s: Peng's original query with no index, either order of constraints ~4s: Peng's original query with idxNamePos, either order ~2.3s: Jim's 'between' query with idxNamePos, either order ~2.7s: Simon's 'chunky' addition to Jim's with idxNameChunkPos, any order So: 1. Simpler is best, in this case. The 'chunking' option paid a ~16% penalty in speed and added considerable complexity and storage requirement. 2. The order of the where constraints did not seem to matter - I guess the optimiser sorts it out or maybe the random nature of the data offers no advantage to any particular order. Tom _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users