Re: [sqlite] Determine if an index has been created

2013-01-31 Thread Simon Slavin

On 31 Jan 2013, at 10:57pm, Igor Tandetnik  wrote:

> On 1/31/2013 5:45 PM, Paul Sanderson wrote:
>> Is it possible to ascertain if an index on a particular column has already
>> been created.
> 
> PRAGMA index_list(YourTable), then for each index, PRAGMA 
> index_info(IndexName)

But note that one index can be on any number of columns.  In fact if you have 
long SELECT statements you probably have long indexes too.  So do you want to 
know if you have an index on just that one column, or if you have any indexes 
which include that column ?

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Determine if an index has been created

2013-01-31 Thread Igor Tandetnik

On 1/31/2013 5:45 PM, Paul Sanderson wrote:

Is it possible to ascertain if an index on a particular column has already
been created.


PRAGMA index_list(YourTable), then for each index, PRAGMA 
index_info(IndexName)

--
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] (no subject)

2013-01-31 Thread Paul Sanderson
This makes a huge difference in speed thanks


>>Also, you can write the query a bit more compactly:

>>select fileref from rtable where vsc > 0 and isgraphic = 1 and md5 not in
>>(select md5 from rtable where isgraphic = 1 and vsc = 0);


On 31 January 2013 19:54, Igor Tandetnik  wrote:

> On 1/31/2013 2:33 PM, Paul Sanderson wrote:
>
>> My query is
>>
>> select fileref from rtable as r where vsc > 0 and isgraphic = 1 and not
>> exists (select md5 fr
>> om rtable as r1 where r.md5 = r1.md5 and isgraphic = 1 and vsc = 0);
>>
>> explain query plan and explain have been run on the table with the results
>> below. Any ideas where and how I can improve performance?
>>
>
> An index on rtable(md5) should help. If you do have one, and it doesn't
> get picked, try changing "isgraphic = 1" to "+isgraphic = 1" in the inner
> select (the unary plus operator makes the index on isgraphic inapplicable,
> hopefully steering the query planner towards a more helpful index).
>
> Also, you can write the query a bit more compactly:
>
> select fileref from rtable where vsc > 0 and isgraphic = 1 and md5 not in
> (select md5 from rtable where isgraphic = 1 and vsc = 0);
>
> --
> Igor Tandetnik
>
>
> __**_
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-**bin/mailman/listinfo/sqlite-**users
>



-- 
Paul
www.sandersonforensics.com
skype: r3scue193
twitter: @sandersonforens
Tel +44 (0)1326 572786
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SQL query

2013-01-31 Thread Marc L. Allen
Actually... with that requirement, I wonder if it's even easier/better to use:

Select name, min(setid), hash
>From rtable 
Group by name, hash
Having min(setid) > 0

-Original Message-
From: sqlite-users-boun...@sqlite.org [mailto:sqlite-users-boun...@sqlite.org] 
On Behalf Of Paul Sanderson
Sent: Thursday, January 31, 2013 5:08 PM
To: General Discussion of SQLite Database
Subject: Re: [sqlite] SQL query

Cool that seems to work - thanks
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SQL query

2013-01-31 Thread Marc L. Allen
Add a group by name, hash and change the select to be name, min(setid), hash?

-Original Message-
From: sqlite-users-boun...@sqlite.org [mailto:sqlite-users-boun...@sqlite.org] 
On Behalf Of Paul Sanderson
Sent: Thursday, January 31, 2013 4:48 PM
To: General Discussion of SQLite Database
Subject: Re: [sqlite] SQL query

Still playing with this

I have the following table and I run the following query - the results of which 
are what I expect

name, num, md5

sqlite> select * from rtable;
$RmMetadata|0|8465-CEEF-126A-0F04-1EDC-1D7B-331F-9279
$RmMetadata|1|8465-CEEF-126A-0F04-1EDC-1D7B-331F-9279
$RmMetadata|2|8465-CEEF-126A-0F04-1EDC-1D7B-331F-9279
$RmMetadata|3|B2F9-B5D4-A427-9FE2-9724-BF95-1571-7CE5
$RmMetadata|4|0546-4667-5A69-6478-FC97-6F27-840D-7D62
$RmMetadata|5|0546-4667-5A69-6478-FC97-6F27-840D-7D62
$RmMetadata|6|0546-4667-5A69-6478-FC97-6F27-840D-7D62
$RmMetadata|7|01EE-7E2E-2242-E734-B125-D02F-A7F0-DC29
$RmMetadata|8|01EE-7E2E-2242-E734-B125-D02F-A7F0-DC29

sqlite> select * from rtable where num > 0 and md5 not in (select md5 
sqlite> from
rtable where num = 0);
$RmMetadata|3|B2F9-B5D4-A427-9FE2-9724-BF95-1571-7CE5
$RmMetadata|4|0546-4667-5A69-6478-FC97-6F27-840D-7D62
$RmMetadata|5|0546-4667-5A69-6478-FC97-6F27-840D-7D62
$RmMetadata|6|0546-4667-5A69-6478-FC97-6F27-840D-7D62
$RmMetadata|7|01EE-7E2E-2242-E734-B125-D02F-A7F0-DC29
$RmMetadata|8|01EE-7E2E-2242-E734-B125-D02F-A7F0-DC29

What I want to do is return only one unique row for each MD5 - so the return 
set would be, this bit is failing me

$RmMetadata|3|B2F9-B5D4-A427-9FE2-9724-BF95-1571-7CE5
$RmMetadata|4|0546-4667-5A69-6478-FC97-6F27-840D-7D62
$RmMetadata|7|01EE-7E2E-2242-E734-B125-D02F-A7F0-DC29

Thanks
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SQL query

2013-01-31 Thread Paul Sanderson
Still playing with this

I have the following table and I run the following query - the results of
which are what I expect

name, num, md5

sqlite> select * from rtable;
$RmMetadata|0|8465-CEEF-126A-0F04-1EDC-1D7B-331F-9279
$RmMetadata|1|8465-CEEF-126A-0F04-1EDC-1D7B-331F-9279
$RmMetadata|2|8465-CEEF-126A-0F04-1EDC-1D7B-331F-9279
$RmMetadata|3|B2F9-B5D4-A427-9FE2-9724-BF95-1571-7CE5
$RmMetadata|4|0546-4667-5A69-6478-FC97-6F27-840D-7D62
$RmMetadata|5|0546-4667-5A69-6478-FC97-6F27-840D-7D62
$RmMetadata|6|0546-4667-5A69-6478-FC97-6F27-840D-7D62
$RmMetadata|7|01EE-7E2E-2242-E734-B125-D02F-A7F0-DC29
$RmMetadata|8|01EE-7E2E-2242-E734-B125-D02F-A7F0-DC29

sqlite> select * from rtable where num > 0 and md5 not in (select md5 from
rtable where num = 0);
$RmMetadata|3|B2F9-B5D4-A427-9FE2-9724-BF95-1571-7CE5
$RmMetadata|4|0546-4667-5A69-6478-FC97-6F27-840D-7D62
$RmMetadata|5|0546-4667-5A69-6478-FC97-6F27-840D-7D62
$RmMetadata|6|0546-4667-5A69-6478-FC97-6F27-840D-7D62
$RmMetadata|7|01EE-7E2E-2242-E734-B125-D02F-A7F0-DC29
$RmMetadata|8|01EE-7E2E-2242-E734-B125-D02F-A7F0-DC29

What I want to do is return only one unique row for each MD5 - so the
return set would be, this bit is failing me

$RmMetadata|3|B2F9-B5D4-A427-9FE2-9724-BF95-1571-7CE5
$RmMetadata|4|0546-4667-5A69-6478-FC97-6F27-840D-7D62
$RmMetadata|7|01EE-7E2E-2242-E734-B125-D02F-A7F0-DC29

Thanks
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] FTS4 languageid : not sure I understand this correctly

2013-01-31 Thread Gert Van Assche
thanks for this link. Most of these tools I tested in the past, but I was
hoping there was an easier path. The best langID tool is not in this list,
by the way. that is:
https://code.google.com/p/chromium-compact-language-detector/

gert

2013/1/31 Petite Abeille 

>
> On Jan 31, 2013, at 9:27 PM, Gert Van Assche  wrote:
>
> > Thanks Michael. Not what I hoped for but now I understand it.
>
> Perhaps of interest:
>
> Language Identification Tools
> http://www.let.rug.nl/~vannoord/TextCat/competitors.html
>
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] (no subject)

2013-01-31 Thread Simon Slavin

On 31 Jan 2013, at 8:58pm, Simon Slavin  wrote:

> Depending on how chunky the values are in each column, a good index for this 
> would be an index on (md5,isgraphic,vsc).

Sorry, that should be one index on (isgraphic,vsc), I think.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] (no subject)

2013-01-31 Thread Simon Slavin

On 31 Jan 2013, at 8:48pm, Paul Sanderson  wrote:

> Thanks all
> 
> All columns in the query are indexed.

That does not do you much good.  Each SELECT can use only one index at a time.  
So if you have one index per column the query uses an index on, say, isgraphic, 
then it will have to search every value in vsc itself.

The way to create indexes is to figure out one index that's useful for each 
SELECT.  Here's the command you want to speed up:

> My query is
> 
> select fileref from rtable as r where vsc > 0 and isgraphic = 1 and not
> exists (select md5 fr
> om rtable as r1 where r.md5 = r1.md5 and isgraphic = 1 and vsc = 0);

Depending on how chunky the values are in each column, a good index for this 
would be an index on (md5,isgraphic,vsc).  Try creating this index and see if 
it helps.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] FTS4 languageid : not sure I understand this correctly

2013-01-31 Thread Petite Abeille

On Jan 31, 2013, at 9:27 PM, Gert Van Assche  wrote:

> Thanks Michael. Not what I hoped for but now I understand it.

Perhaps of interest:

Language Identification Tools
http://www.let.rug.nl/~vannoord/TextCat/competitors.html

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] FTS4 languageid : not sure I understand this correctly

2013-01-31 Thread Gert Van Assche
Thanks Michael. Not what I hoped for but now I understand it.

2013/1/31 Michael Black 

> According to the docs:
> http://www.sqlite.org/fts3.html#section_6_3
>
> It's YOUR choice as to what to put in there.  A separate index is created
> for each language id.
> So it's designed to prevent cross-lang contamination in the index.
>
> -Original Message-
> From: sqlite-users-boun...@sqlite.org
> [mailto:sqlite-users-boun...@sqlite.org] On Behalf Of Gert Van Assche
> Sent: Thursday, January 31, 2013 2:19 PM
> To: sqlite-users
> Subject: [sqlite] FTS4 languageid : not sure I understand this correctly
>
> All,
>
> I have the feeling this is the most stupid question ever, but...
> If I create a FTS4 table, put text in it, could I use the languageid to
> figure out what Language that text actually is?
> Is that how langID works?
> I did some tests, but the LangID seems to be 0 all the time, so or I'm
> doing something wrong, or I misunderstand the process.
>
> thanks
>
> gert
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] FTS4 languageid : not sure I understand this correctly

2013-01-31 Thread Gert Van Assche
All,

I have the feeling this is the most stupid question ever, but...
If I create a FTS4 table, put text in it, could I use the languageid to
figure out what Language that text actually is?
Is that how langID works?
I did some tests, but the LangID seems to be 0 all the time, so or I'm
doing something wrong, or I misunderstand the process.

thanks

gert
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] (no subject)

2013-01-31 Thread Igor Tandetnik

On 1/31/2013 2:33 PM, Paul Sanderson wrote:

My query is

select fileref from rtable as r where vsc > 0 and isgraphic = 1 and not
exists (select md5 fr
om rtable as r1 where r.md5 = r1.md5 and isgraphic = 1 and vsc = 0);

explain query plan and explain have been run on the table with the results
below. Any ideas where and how I can improve performance?


An index on rtable(md5) should help. If you do have one, and it doesn't 
get picked, try changing "isgraphic = 1" to "+isgraphic = 1" in the 
inner select (the unary plus operator makes the index on isgraphic 
inapplicable, hopefully steering the query planner towards a more 
helpful index).


Also, you can write the query a bit more compactly:

select fileref from rtable where vsc > 0 and isgraphic = 1 and md5 not in
(select md5 from rtable where isgraphic = 1 and vsc = 0);

--
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] (no subject)

2013-01-31 Thread Simon Slavin

On 31 Jan 2013, at 7:33pm, Paul Sanderson  wrote:

> explain query plan and explain have been run on the table with the results
> below. Any ideas where and how I can improve performance?

Can you show us the indexes you've created on rtable ?

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unable to use table alias inside parenthesis on right side of join

2013-01-31 Thread Ryan Johnson

Looks like exactly the same thing, yes. No reply, unfortunately...

On 31/01/2013 1:05 PM, Kevin Benson wrote:

I wonder if this earlier mention is related?

http://osdir.com/ml/sqlite-users/2012-07/msg00054.html
--
--
   --
  --Ô¿Ô--
 K e V i N


On Thu, Jan 31, 2013 at 12:04 PM, Ryan Johnson
wrote:


Hi all,

Strange one here... consider the following schema:

R(a,b)
S(b,c)

These queries all work fine:

select R1.* from R R1 join S S1 using(b);
select S1.* from R R1 join S S1 using(b);
select R1.* from (R R1 join S S1 using(b));
select S1.* from (R R1 join S S1 using(b));
select R1.* from (R R1 join S1 using(b)) join (R R2 join S2 using(b))
using (a,c);
select S1.* from (R R1 join S1 using(b)) join (R R2 join S2 using(b))
using (a,c);

But these two do not:

select R2.* from (R R1 join S S1 using(b)) join (R R2 join S S2 using(b))
using (a,c);


Error: no such table: R2


select S2.* from (R R1 join S S1 using(b)) join (R R2 join S S2 using(b))
using (a,c);


Error: no such table: S2


Why are R1/S1 aliases visible at top level but R2/S2 aren't? It seems that
if the parentheses made those joined pairs into nested queries, then R1/S1
should be lost; if parentheses are only for grouping, then R2/S2 should be
visible.  Either way, something seems a bit off.

The official syntax diagrams [1] suggest that parentheses imply only
grouping in this context. Then again, they also suggest it should be a
syntax error for a table alias to follow a join source in parentheses...
and that actually works fine, other than making the inner table aliases
unavailable (which kind of makes sense):

select J1.* from (R R1 join S S1 using(b)) J1 join (R R2 join S S2
using(b)) J2 using (a,c);
select J2.* from (R R1 join S S1 using(b)) J1 join (R R2 join S S2
using(b)) J2 using (a,c);
select R1.* from (R R1 join S S1 using(b)) J1 join (R R2 join S S2
using(b)) J2 using (a,c);


Error: no such table: R1


And yes, I've since ditched the USING syntax and gone back to WHERE
clauses, though the resulting is 20% longer and arguably much less clear
[2]:

select R1.*, R2.* from (R R1 join S using(b)) join (R R2 join S using(b))
using(a,c) where R1.b != R2.b
select R1.*, R2.* from R R1, R R2, S S1, S S2 where R1.b=S1.b and
R2.b=S2.b and R1.a=R2.a and S1.c=S2.c and R1.b != R2.b

[1] 
http://www.sqlite.org/**syntaxdiagrams.html#join-**source

[2] Note that I added a predicate in, which I had stripped from the
earlier test cases for clarity

Thoughts?
Ryan

__**_
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-**bin/mailman/listinfo/sqlite-**users


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unable to use table alias inside parenthesis on right side of join

2013-01-31 Thread Kevin Benson
I wonder if this earlier mention is related?

http://osdir.com/ml/sqlite-users/2012-07/msg00054.html
--
   --
  --
 --Ô¿Ô--
K e V i N


On Thu, Jan 31, 2013 at 12:04 PM, Ryan Johnson
wrote:

> Hi all,
>
> Strange one here... consider the following schema:
>
> R(a,b)
> S(b,c)
>
> These queries all work fine:
>
> select R1.* from R R1 join S S1 using(b);
> select S1.* from R R1 join S S1 using(b);
> select R1.* from (R R1 join S S1 using(b));
> select S1.* from (R R1 join S S1 using(b));
> select R1.* from (R R1 join S1 using(b)) join (R R2 join S2 using(b))
> using (a,c);
> select S1.* from (R R1 join S1 using(b)) join (R R2 join S2 using(b))
> using (a,c);
>
> But these two do not:
>
> select R2.* from (R R1 join S S1 using(b)) join (R R2 join S S2 using(b))
> using (a,c);
>
>> Error: no such table: R2
>>
> select S2.* from (R R1 join S S1 using(b)) join (R R2 join S S2 using(b))
> using (a,c);
>
>> Error: no such table: S2
>>
>
> Why are R1/S1 aliases visible at top level but R2/S2 aren't? It seems that
> if the parentheses made those joined pairs into nested queries, then R1/S1
> should be lost; if parentheses are only for grouping, then R2/S2 should be
> visible.  Either way, something seems a bit off.
>
> The official syntax diagrams [1] suggest that parentheses imply only
> grouping in this context. Then again, they also suggest it should be a
> syntax error for a table alias to follow a join source in parentheses...
> and that actually works fine, other than making the inner table aliases
> unavailable (which kind of makes sense):
>
> select J1.* from (R R1 join S S1 using(b)) J1 join (R R2 join S S2
> using(b)) J2 using (a,c);
> select J2.* from (R R1 join S S1 using(b)) J1 join (R R2 join S S2
> using(b)) J2 using (a,c);
> select R1.* from (R R1 join S S1 using(b)) J1 join (R R2 join S S2
> using(b)) J2 using (a,c);
>
>> Error: no such table: R1
>>
>
> And yes, I've since ditched the USING syntax and gone back to WHERE
> clauses, though the resulting is 20% longer and arguably much less clear
> [2]:
>
> select R1.*, R2.* from (R R1 join S using(b)) join (R R2 join S using(b))
> using(a,c) where R1.b != R2.b
> select R1.*, R2.* from R R1, R R2, S S1, S S2 where R1.b=S1.b and
> R2.b=S2.b and R1.a=R2.a and S1.c=S2.c and R1.b != R2.b
>
> [1] 
> http://www.sqlite.org/**syntaxdiagrams.html#join-**source
>
> [2] Note that I added a predicate in, which I had stripped from the
> earlier test cases for clarity
>
> Thoughts?
> Ryan
>
> __**_
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-**bin/mailman/listinfo/sqlite-**users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Unable to use table alias inside parenthesis on right side of join

2013-01-31 Thread Ryan Johnson

Hi all,

Strange one here... consider the following schema:

R(a,b)
S(b,c)

These queries all work fine:

select R1.* from R R1 join S S1 using(b);
select S1.* from R R1 join S S1 using(b);
select R1.* from (R R1 join S S1 using(b));
select S1.* from (R R1 join S S1 using(b));
select R1.* from (R R1 join S1 using(b)) join (R R2 join S2 using(b)) 
using (a,c);
select S1.* from (R R1 join S1 using(b)) join (R R2 join S2 using(b)) 
using (a,c);


But these two do not:

select R2.* from (R R1 join S S1 using(b)) join (R R2 join S S2 
using(b)) using (a,c);

Error: no such table: R2
select S2.* from (R R1 join S S1 using(b)) join (R R2 join S S2 
using(b)) using (a,c);

Error: no such table: S2


Why are R1/S1 aliases visible at top level but R2/S2 aren't? It seems 
that if the parentheses made those joined pairs into nested queries, 
then R1/S1 should be lost; if parentheses are only for grouping, then 
R2/S2 should be visible.  Either way, something seems a bit off.


The official syntax diagrams [1] suggest that parentheses imply only 
grouping in this context. Then again, they also suggest it should be a 
syntax error for a table alias to follow a join source in parentheses... 
and that actually works fine, other than making the inner table aliases 
unavailable (which kind of makes sense):


select J1.* from (R R1 join S S1 using(b)) J1 join (R R2 join S S2 
using(b)) J2 using (a,c);
select J2.* from (R R1 join S S1 using(b)) J1 join (R R2 join S S2 
using(b)) J2 using (a,c);
select R1.* from (R R1 join S S1 using(b)) J1 join (R R2 join S S2 
using(b)) J2 using (a,c);

Error: no such table: R1


And yes, I've since ditched the USING syntax and gone back to WHERE 
clauses, though the resulting is 20% longer and arguably much less clear 
[2]:


select R1.*, R2.* from (R R1 join S using(b)) join (R R2 join S 
using(b)) using(a,c) where R1.b != R2.b
select R1.*, R2.* from R R1, R R2, S S1, S S2 where R1.b=S1.b and 
R2.b=S2.b and R1.a=R2.a and S1.c=S2.c and R1.b != R2.b


[1] http://www.sqlite.org/syntaxdiagrams.html#join-source

[2] Note that I added a predicate in, which I had stripped from the 
earlier test cases for clarity


Thoughts?
Ryan

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] FTS SQLite "optimize" command creates very large BLOBs in *_segment table.

2013-01-31 Thread Dominique Pellé
Hi

I have a database using SQLite-3.7.14 with a FTS4 virtual table (Free
Text Search). The FTS table contains several millions of small documents.

The FTS DB is created on a server (where creating time does not matter)
and then used on an embedded device as a read-only database for FTS
queries (where speed of queries need to be as fast as possible and use
as little memory as possible).

Since the DB is used read-only on the device, I ran the the "optimize"
command (see http://www.sqlite.org/fts3.html#optimize) in the FTS index
after the last INSERT was done on the server.

Unfortunately, I found that the "optimize" command is creating very large
BLOBs in the database since it merges together all of inverted index b-trees
into very large complete b-tree. For frequent terms, merged BLOBs can be
up to ~50MB in my case in the ftsTable_segment table, as a result of
running "optimize":

  sqlite> SELECT *, length(block) AS len_blob
FROM ftsTable_segments
ORDER BY len_blob DESC LIMIT 3;

  seblockid|block|len_blob
  336808||51867353 <-- 51 MB BLOB!
  311724||19375541 <-- 19 MB BLOB
  334719||19223423 <-- 19 MB BLOB

Such large BLOBs (~50MB) are a problem in my case as they consume
a  large amount of memory on a embedded device, when doing FTS
queries with several frequent terms. SQLite memory high watermark
reported by sqlite3_memory_highwater() reaches ~200MB when query
contains several frequent terms, which is too much for an embedded
device, even though I set of soft memory limit of only 3MB using
sqlite3_soft_heap_limit64(...).

As a result, I have disabled running "optimize" on the FTS index
after creating the FTS DB on the server.

However, it got me thinking: it would be nice to be able to pass an
optional parameter to the FTS "optimize" command in order to
avoid merging b-trees when BLOBs reach a certain size?  In other
words, instead of doing...

INSERT INTO ftsTable(ftsTable) VALUES('optimize');

... I would like to be able to do something like this...

INSERT INTO ftsTable(ftsTable) VALUES('optimize=1048576');

... where optimize=1048576 indicates to *partially* optimize
reverse index b-trees in such a way that BLOBs do not exceed
1MB (1048577 bytes) in this example. It's OK if it's a fuzzy soft limit.


1/ Wouldn't such partial optimization of FTS index be useful?

2/ I also suggest that the documentation at
http://www.sqlite.org/fts3.html#optimize
indicates that optimizing an FTS index can create very large BLOBs.
This may be OK on a desktop or server, but it can be a problem
on embedded devices with limited amount of memory.

Regards
Dominique
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users