Re: [sqlite] EXTERNAL:Re: What is the most efficient way to gettheclosebynumbers?

Black, Michael (IS) Sun, 22 Aug 2010 06:30:37 -0700

My first reaction to all this was...why use SQL to do this?
Then after I saw 333450 results I was wondering if that was accurate.
 
Out of 10,000 records there should be 384.62 average name matches
The odds of being < 10 with 0-99 position is approx 10000*.1
This gives us 384.62*(10000-384.62)*0.1 or 369,822 matches -- it will be a 
little less due to the edge conditions.
 
I'm averaging 363,000 or so matches.  That's a 10% difference.
I've got 10,000 records with 'a'-'z' name and 0-99 position.
 
So here's C++ that runs in 0.068 seconds -- about 34 times faster than the 
fastest query shown.
 
#include <iostream>
#include <vector>
using namespace std;
#define N 10000
#define POS 100
#define DX 10
struct s_datum
{
  unsigned char name;
  unsigned char pos;
} datum;
int
main (int argc, char *argv[])
{
  vector < struct s_datum >v;
  srandom (time (NULL));
  for (int i = 0; i < N; i++) {
    datum.name = (random () % 26) + 'a';
    datum.pos = random () % POS;
    v.push_back (datum);
    //cout << datum.name << "," << (int)datum.pos << endl;
  }
  int match = 0;
  for (int i = 0; i < 10000; i++) {
    for (int j = i; j < N; j++) {
      if (v[i].name != v[j].name) {
        continue;
      }
      int dx = abs (v[i].pos - v[j].pos);
      if (dx != 0 && dx <= DX) {
        match++;
        //              cout << v[i].name << "," << (int)v[i].pos << "," << 
v[j].name << "," << (int)v[j].pos << "," "dx=" << dx << endl;
      }
    }
  }
  cout << match << endl;
}


 
 
Michael D. Black
Senior Scientist
Advanced Analytics Directorate
Northrop Grumman Information Systems
 

________________________________

From: sqlite-users-boun...@sqlite.org on behalf of ve3meo
Sent: Sat 8/21/2010 4:25 PM
To: sqlite-users@sqlite.org
Subject: EXTERNAL:Re: [sqlite] What is the most efficient way to 
gettheclosebynumbers?



"Simon Slavin" ...
>
>> Here's how I have interpreted Simon's suggested chunky query:
>>
>> select * from A as A1, A as A2 where A1.name=A2.name and
>> A1.position != A2.position and
>> A2.chunk between A1.chunk - 1 and A1.chunk + 1 and
>> A2.position between A1.position - 10 and A1.position + 10 ;
>
> That's exactly what I meant.  It allows you to reject everything not in
> those three chunks immediately, before even working out what 'position -
> 10' and 'position + 10' are.  But looking again at your formulation with
> 'between' in ...
>
>> select * from A as A1, A as A2 where A1.name=A2.name and
>> A1.position != A2.position and
>> A2.position between A1.position - 10 and A1.position + 10 ;
>
> I think that that might be just as fast, and not require you to work out
> the chunks, or take up all the space the chunk indexes will take up, or
> the time it will take to generate them.  So we're back to the same old
> song: Try the simpler solution.  If it's fast enough, use it.  Only if it
> isn't fast enough is it worth optimizing.
>
> By the way, you might find that swapping the last two lines makes it
> faster:
>
>
>> select * from A as A1, A as A2 where A1.name=A2.name and
>> A2.position between A1.position - 10 and A1.position + 10 and
>> A1.position != A2.position ;
>
> But you might not: SQLite's optimizer may already have spotted it.

I did some tests on a 10,000 row table similar to Peng's example with name a
random alpha and random position 0-99, distance 10: each query returned
333,450 results.

~62s: Peng's original query with no index, either order of constraints
~4s: Peng's original query with idxNamePos, either order
~2.3s: Jim's 'between' query with idxNamePos, either order
~2.7s: Simon's 'chunky' addition to Jim's with idxNameChunkPos, any order

So:
1. Simpler is best, in this case. The 'chunking' option paid a ~16% penalty
in speed and added considerable complexity and storage requirement.
2. The order of the where constraints did not seem to matter - I guess the
optimiser sorts it out or maybe the random nature of the data offers no
advantage to any particular order.

Tom



_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] EXTERNAL:Re: What is the most efficient way to gettheclosebynumbers?

Reply via email to