Re: [sqlite] Is it possible to optimize this query on a very large database table

2011-10-13 Thread Frank Chang

Igor Tandetnik,
 
>> How come you only show one function? A user-defined aggregate function is 
>> actually represented by two C[++] functions - one that is called for every 
>> row 
>> and performs actual aggregation, and another that's called at the end of 
>> each 
>> group, reports the result and resets the state machine to prepare for the 
>> next 
>> group. You can use sqlite3_context to store state between invocations - see 
>> sqlite3_aggregate_context.

 We have defined 2 C++ function XStep and  XFinalize(shown below). 
The group by last name BLOB results look accurate. Thank you for your help.
 
void cIntersectingGroupCache::XFinalize(sqlite3_context *context){
 listCtx *p;  
 char *buf=NULL;

 buf = (char *) malloc ((sizeof(int) * 
((cIntersectingGroupCache*)(p->TheThis))->Column2.size())+ 4);
 if (buf == NULL)
  printf("malloc error in XFinalize, buf\n");
 
 sqlite3_result_blob(context,buf,
 (((cIntersectingGroupCache*)(p->TheThis))->Column2.size()*sizeof(int)) 
+ 4, free);
 
 ((cIntersectingGroupCache*)(p->TheThis))->Column2.clear();





}



 
 

  
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Is it possible to optimize this query on a very large database table

2011-10-11 Thread Igor Tandetnik
Frank Chang  wrote:
>The explain query plan for select FieldName, min(rowid) from 
> BlobLastNameTest group by FieldName shows a full index scan.

Of course. How else do you expect to be able to look at every FieldName?

You seem to be expecting some kind of a "compressed" index - an index with as 
many entried as there are distinct values of FieldName, each entry pointing 
to... I'm not sure what, exactly. There ain't no such thing. An index on 
FieldName still has as many entries as there are rows in the underlying table - 
it's just sorted by FieldName. To get a list of all distinct values of 
FieldName, SQLite has to scan this index, and simply discard any value that is 
equal to that from previous row.

If you think you need such a "compressed index", you would have to maintain it 
yourself, as a separate table. Personally, based on the description of your 
problem, I don't think you need any such thing.

>  Here is how I might do the update:
> 
>   1. CREATE TABLE FOO(FIELDNAME CHAR(25), IDROW INT);
>   2  INSERT INTO FOO select FieldName, min(rowid) from BlobLastNameTest 
> group by FieldName
>   3. INSERT OR REPLACE INTO BLOBLASTNAMETEST SELECT t1.FIELDNAME, 
> UDF(t1.ROWID,t1.FIELDNAME,this,'BLOBLASTNAMETEST')  FROM
> FOO WHERE BLOBLASTNAMETEST.FIELDNAME = FOO.FIELDNAME AND 
> BLOBLASTNAMETEST.ROWID = FOO.IDROW. 

This last query makes no sense to me. It refers to BLOBLASTNAMETEST in WHERE 
clause though it was never mentioned in FROM clause. It refers to identifiers 
"t1" and "this" that were never declared.

How about this? Drop steps 1 and 2, and run this statement instead:

insert or replace into BlobLastNameTest(rowid, FieldName, Vertices)
select min(rowid), FieldName, MyAccumulation(Vertices)
from BlobLastNameTest
group by FieldName;

where MyAccumulation is a custom aggregate function that does whatever you mean 
by "accumulate". This way, you should be able to do everything in a single 
pass. See http://sqlite.org/c3ref/create_function.html , the description of 
xStep and xFinal parameters, for an explanation of how to set up a custom 
aggregate function.
-- 
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Is it possible to optimize this query on a very large database table

2011-10-11 Thread Frank Chang

Igor Tandetnik, 
 
The explain query plan for select FieldName, min(rowid) from 
BlobLastNameTest group by FieldName shows a full index scan, even after I run 
ANALYZE TABLE AND INDEX ON FIELDNAME, which could be problematic when the 
number of rows in BlobLastNameTest increases by a factor 10 to 30 million rows. 
But since Sqlite does not have a DISTINCT ON clause as Florian Weimer as 
pointed out two days ago,we will have to investigate this issue in another 
possible thread.
 
  Here is how I might do the update:
 
   1. CREATE TABLE FOO(FIELDNAME CHAR(25), IDROW INT);
   2  INSERT INTO FOO select FieldName, min(rowid) from BlobLastNameTest 
group by FieldName 
   3. INSERT OR REPLACE INTO BLOBLASTNAMETEST SELECT t1.FIELDNAME, 
UDF(t1.ROWID,t1.FIELDNAME,this,'BLOBLASTNAMETEST')  FROM FOO WHERE 
BLOBLASTNAMETEST.FIELDNAME = FOO.FIELDNAME AND BLOBLASTNAMETEST.ROWID = 
FOO.IDROW.
 
WHERE THE UDF looks like this:
 

void cIntersectingGroupCache::UDFFunc(sqlite3_context *context, int argc, 
sqlite3_value **argv){
   char* TableName;
   int size; 
   int* ip2;
   long long int iVal;
 
   ((cIntersectingGroupCache*)(sqlite3_value_int(argv[2])))->Column.clear();
 
   switch( sqlite3_value_type(argv[0]) ){
  case SQLITE_INTEGER: {
 iVal = sqlite3_value_int64(argv[0]);
 iVal = ( iVal > 0) ? 1: ( iVal < 0 ) ? -1: 0;
 sqlite3_result_int64(context, iVal);
 break;
 }
 case SQLITE_NULL: {
size = sqlite3_value_bytes(argv[3]);
TableName = new char[size + 1];
memcpy(TableName, sqlite3_value_text(argv[3]),size);
TableName[size] = '\x0';

((cIntersectingGroupCache*)(sqlite3_value_int(argv[2])))->RowCountMap[TableName]
 += 1;
delete [] TableName;
break;
   }
   case SQLITE_BLOB: {
  size = sqlite3_value_bytes(argv[0]);
  ip2 = (int *)sqlite3_value_blob(argv[0]);
  for (int i = 0; i < size/sizeof(int); i++){
 
((cIntersectingGroupCache*)(sqlite3_value_int(argv[2])))->Column.push_back(ip2[i]);
 }
 break;
  }
  default: {
break;
  }
}
 
switch( sqlite3_value_type(argv[1]) ){
   case SQLITE_INTEGER: {
int iVal = sqlite3_value_int(argv[1]);

((cIntersectingGroupCache*)(sqlite3_value_int(argv[2])))->Column.push_back(iVal);
char* blob = 
reinterpret_cast(&(((cIntersectingGroupCache*)(sqlite3_value_int(argv[2])))->Column[0]));

sqlite3_result_blob(context,blob,((cIntersectingGroupCache*)(sqlite3_value_int(argv[2])))->Column.size()*sizeof(int),NULL);
   break;
 }
default: {
   break;
}
}
}
 
Thank you for your help.
  
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Is it possible to optimize this query on a very large database table Oct. 10, 2011 13:53:01 GMT

2011-10-10 Thread Frank Chang

 Igor Tandetnik,
 
>>> Explain the problem you are trying to solve, *not* your proposed solution. 
>>> <<<

 
  What we are trying to achieve is to to find the minimum row id for each 
unique Field Name in BLobLastNameTest where many rows can have the same 
FIELDNAME but distinct BLOBS(Vertices Column).
 
   Once we know  the first/minimum row id  of each unique fieldname group, 
we would like to write a sqlite UPDATE Statement to accumulate all the BLOB 
vertices of all the rows with that unique fieldname into the first(i.e 
MIN(ROWID))  row's BLOB(Vertices)column  for  each unique fieldname group.
 
The reason we can't use select FieldName, rowid from BlobLastNameTest is 
that it would slow our C++ Windows/LINUx/Solaris UNIX worker threads so much 
that it wouldn;t be worth multithreading this whole process. We were hoping 
that the SQLITE query processor and the appropriate indexes could accomplish 
these previous 2 steps in a few minutes for a reasonable number of 
BLOBLASTNAMETEST rows. Thank you for all of your help. 
  
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Is it possible to optimize this query on a very large database table.

2011-10-10 Thread Igor Tandetnik
Frank Chang  wrote:
> Igor Tandetnik,
>  The fieldname groups in our BlobLastNameTable consist of 
> multiple rows where each pair of columns [FieldName,
>   BLOB[Vertices]] is unique.

How so? You have FieldName declared as PRIMARY KEY. From your original post:

CREATE TABLE [BlobLastNameTest] ([FieldName] CHAR (25) PRIMARY KEY, [Vertices] 
BLOB )

PRIMARY KEY means there can only be one row with any given value of FieldName.

If you have since changed your schema, then post the new CREATE TABLE statement 
for BlobLastNameTest table, and any CREATE INDEX statements related to it.

> Therefore, every fieldname group does not just have a single row but instead 
> 1000's or
>  1's rows. So that is why we use a group by/order by and 
> subselect clause to locate the first/minimum row id  row
> in each fieldname group.

Assuming this is true, what's wrong with

select FieldName, min(rowid) from BlobLastNameTest group by FieldName;

> Once we know  the first/minimum row id  of each unique fieldname group, we 
> would lke to write a
> sqlite UPDATE Statement to accumulate all the BLOB vertices of all the rows 
> with that unique fieldname into the first(i.e
> MIN(ROWID))  row's BLOB(Vertices)column  for  each unique fieldname group.

What do you mean by "accumulate"? Concatenate? I don't think you can do that 
with SQL alone - you'll have to write some code. Personally, I'd do something 
like this (in pseudocode):

stmt = prepare("select FieldName, rowid, Vertices from BlobLastNameTest order 
by FieldName, rowid")
currentFieldName = "";
firstRowId = -1;
blob = ""
while (stmt.Step) {
  if (currentFieldName != stmt.FieldName) {
commitBlob(firstRowId, blob)
currentFieldName = stmt.FieldName
firstRowId = stmt.rowid
blob = ""
  }
  blob += stmt.Vertices  // whatever you mean by "accumulate", do it here
}
commitBlob(firstRowId, blob)

function commitBlob(rowid, blob) {
  if (rowid > 0) {
execute "update BlobLastNameTest set Vertices = ? where rowid = ?;"
with parameters (blob, rowid)
  }
}

> Then we would like to  discard all the rows  in each
> fieldname group of rows that have an rowid different from the first row

That one's easy:

delete from BlobLastNameTest where rowid !=
(select min(rowid) from BlobLastNameTest t2
 where t2.FieldName = BlobLastNameTest.FieldName);

-- 
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Is it possible to optimize this query on a very large database table.

2011-10-10 Thread Frank Chang

Igor Tandetnik, 
  The fieldname groups in our BlobLastNameTable consist of 
multiple rows where each pair of columns [FieldName, BLOB[Vertices]] is unique. 
Therefore, every fieldname group does not just have a single row but instead 
1000's or 1's rows. So that is why we use a group by/order by and subselect 
clause to locate the first/minimum row id  row in each fieldname group. 
   Once we know  the first/minimum row id  of each unique fieldname 
group, we would lke to write a sqlite UPDATE Statement to accumulate all the 
BLOB vertices of all the rows with that unique fieldname into the first(i.e 
MIN(ROWID))  row's BLOB(Vertices)column  for  each unique fieldname group. Then 
we would like to  discard all the rows  in each fieldname group of rows that 
have an rowid different from the first row(i.e MIN(rowid)).
  Because we using a C++ WINDOWS/LINUX/Solaris UNIX multithreaded 
program where each concurrent worker thread has its own sqlite database and 
sqlite table and sqlite index , we would like each concurrence worker thread to 
run as fast as possible on a multicore CPU. We have profiled the worker threads 
and we have found that the sqlite statements are the bottleneck. So, that is 
why we would like our queries to run as fast as possible by avoiding full index 
scans as Florian Weimer pointed out in the post about Postgres' DISTINCT ON 
feature which we are trying to simulate on Sqlite. 
 I hope I have provided you more information. I did try your 
suggestion:   select FieldName, rowid from BlobLastNameTest. However, it 
generates 5.5 million rows of output which would force our C++ Worker threads 
to do the BLOB(Vertices) aggregation. Unfortunately , that would would take too 
long so we were hoping that we could devise a nice query to let the SQLite 
query processor do all the Blob(Vertices) aggregation in a few minutes or less 
. Please let me know if you have any questions or suggestions. Thank you for 
your help.  
-- 
  
 
 
--
I take it back - the behavior of your query is well-defined, for the simple 
reason that FieldName is unique,
so every group only has a single row in it (as someone else has kindly pointed 
out - I missed this detail on the
first reading). For that same reason, the whole GROUP BY and sub-select dance 
is completely pointless.
Your query is just a very elaborate and wasteful way to write

select FieldName, rowid from BlobLastNameTest; 


  
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Is it possible to optimize a query on a very large database table?

2011-10-10 Thread Igor Tandetnik
Frank Chang  wrote:
> Florian Weimar and  Igor Tadetnik,
> 
> When I replace the GROUP BY t1.FIELDNAME with ORDER BY 1,
> 
> select t1.FieldName,t1.rowid from BlobLastNameTest t1 where t1.r
> owid = (SELECT MIN(rowid) FROM BlobLastNameTest where FieldName = 
> t1.FIELDNAME)
> order by 1;

This query doesn't make any sense, no more than the one with GROUP BY did. 
Let's step back for a minute - what exactly are you trying to achieve? In what 
way does this simple query fail to reach your goal, whatever that may be:

select FieldName, rowid from BlobLastNameTest;

Explain the problem you are trying to solve, *not* your proposed solution.
-- 
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Is it possible to optimize a query on a very large database table?

2011-10-10 Thread Frank Chang

Florian Weimar and  Igor Tadetnik,
 
 When I replace the GROUP BY t1.FIELDNAME with ORDER BY 1, 
 
select t1.FieldName,t1.rowid from BlobLastNameTest t1 where t1.r
owid = (SELECT MIN(rowid) FROM BlobLastNameTest where FieldName = t1.FIELDNAME)
order by 1;


 
the explain output seems to have 40% less steps. Does this mean order by 1 
should be faster than group by t1.FIELDNAME as I scale up the number of rows in 
the very large database table BLOBLASTNAMETEST. The explain output is ahown 
below. Thank you. 
 
 
sqlite> explain select t1.FieldName,t1.rowid from BlobLastNameTest t1 where t1.r
owid = (SELECT MIN(rowid) FROM BlobLastNameTest where FieldName = t1.FIELDNAME)
order by 1;
0|Trace|0|0|0||00|
1|Noop|0|0|0||00|
2|Goto|0|32|0||00|
3|OpenRead|3|108789|0|keyinfo(1,BINARY)|00|
4|Rewind|3|30|1|0|00|
5|IdxRowid|3|1|0||00|
6|Null|0|3|0||00|
7|Integer|1|4|0||00|
8|Null|0|6|0||00|
9|Null|0|5|0||00|
10|OpenRead|4|108789|0|keyinfo(1,BINARY)|00|
11|Column|3|0|7||00|
12|IsNull|7|20|0||00|
13|SeekGe|4|20|7|1|00|
14|IdxGE|4|20|7|1|01|
15|IdxRowid|4|9|0||00|
16|CollSeq|0|0|0|collseq(BINARY)|00|
17|AggStep|0|9|5|min(1)|01|
18|Goto|0|20|0||00|
19|Next|4|14|0||00|
20|Close|4|0|0||00|
21|AggFinal|5|1|0|min(1)|00|
22|SCopy|5|10|0||00|
23|Move|10|3|1||00|
24|IfZero|4|25|-1||00|
25|Ne|3|29|1||6c|
26|Column|3|0|11||00|
27|IdxRowid|3|12|0||00|
28|ResultRow|11|2|0||00|
29|Next|3|5|0||00|
30|Close|3|0|0||00|
31|Halt|0|0|0||00|
32|Transaction|0|0|0||00|
33|VerifyCookie|0|7|0||00|
34|TableLock|0|2|0|BlobLastNameTest|00|
35|Goto|0|3|0||00|
sqlite>
 
 
 
 
 
sqlite> explain select t1.FieldName,t1.rowid from BlobLastNameTest t1 where t1.r
owid = (SELECT MIN(rowid) FROM BlobLastNameTest where FieldName = t1.FIELDNAME)
GROUP by t1.FieldName;
0|Trace|0|0|0||00|
1|Noop|0|0|0||00|
2|Integer|0|4|0||00|
3|Integer|0|3|0||00|
4|Gosub|6|51|0||00|
5|Goto|0|55|0||00|
6|OpenRead|3|108789|0|keyinfo(1,BINARY)|00|
7|Rewind|3|40|9|0|00|
8|IdxRowid|3|9|0||00|
9|Null|0|11|0||00|
10|Integer|1|12|0||00|
11|Null|0|14|0||00|
12|Null|0|13|0||00|
13|OpenRead|4|108789|0|keyinfo(1,BINARY)|00|
14|Column|3|0|15||00|
15|IsNull|15|23|0||00|
16|SeekGe|4|23|15|1|00|
17|IdxGE|4|23|15|1|01|
18|IdxRowid|4|17|0||00|
19|CollSeq|0|0|0|collseq(BINARY)|00|
20|AggStep|0|17|13|min(1)|01|
21|Goto|0|23|0||00|
22|Next|4|17|0||00|
23|Close|4|0|0||00|
24|AggFinal|13|1|0|min(1)|00|
25|SCopy|13|18|0||00|
26|Move|18|11|1||00|
27|IfZero|12|28|-1||00|
28|Ne|11|39|9||6c|
29|Column|3|0|8||00|
30|Compare|7|8|1|keyinfo(1,BINARY)|00|
31|Jump|32|36|32||00|
32|Move|8|7|1||00|
33|Gosub|5|45|0||00|
34|IfPos|4|54|0||00|
35|Gosub|6|51|0||00|
36|Column|3|0|1||00|
37|IdxRowid|3|2|0||00|
38|Integer|1|3|0||00|
39|Next|3|8|0||00|
40|Close|3|0|0||00|
41|Gosub|5|45|0||00|
42|Goto|0|54|0||00|
43|Integer|1|4|0||00|
44|Return|5|0|0||00|
45|IfPos|3|47|0||00|
46|Return|5|0|0||00|
47|Copy|1|19|0||00|
48|Copy|2|20|0||00|
49|ResultRow|19|2|0||00|
50|Return|5|0|0||00|
51|Null|0|1|0||00|
52|Null|0|2|0||00|
53|Return|6|0|0||00|
54|Halt|0|0|0||00|
55|Transaction|0|0|0||00|
56|VerifyCookie|0|7|0||00|
57|TableLock|0|2|0|BlobLastNameTest|00|
58|Goto|0|6|0||00|
sqlite>   
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users