Re: [sqlite] finding pairwise similarity

2012-01-14 Thread Jean-Christophe Deschamps



where `similarity` is a measure of overlap of `t` between each pair of 
`c` and is calculated as the number of `t` common between a given pair 
divided by the total number of unique `t` in the pair.


Just like Igor, I take "unique `t` in pair" to mean "distinct `t` in pair".



I want the result to be

source_c target_c similarity
  --
1   2 0.2 (= 1 common `t` between the 2 `c` / by 
total 4 `t`)


Here I take 0.2 to mean 0.25.


If so, here's some more amunition:

select src.c_no "Source c",
   trg.c_no "Target c",
   (select count(*) from (select 1 from c_t where c_t.c_no in 
(src.c_no, trg.c_no) group by c_t.t_no having count(*) > 1))

   * 1.0 /
   (select count(*) from (select distinct t_no from c_t where 
c_t.c_no in (src.c_no, trg.c_no))) Similarity

   from c src join c trg on (src.c_no < trg.c_no);

If I'm not mistaken, both versions should give identical 
results.  Check by yourself on real data and see which is faster for 
you (most probably Igor's).


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Some pieces of the puzzle are coming together

2012-01-14 Thread John Elrick
On Fri, Jan 13, 2012 at 5:13 PM, Max Vlasov  wrote:

> John,
> I read your previous post and unfortunately, your conversation with Richard
> didn't reveal much details. I worked with scenario similar tou yours
> (Delphi + statically linked sqlite) for several years and during this
> period an advanced admin/console form was developed that allows dynamical
> loading of different versions of sqlite, reporting vfs bandwidth, times and
> even the number of memory requests. So consider sharing more information
> about your db and queries here or contact me directly if you wish, I hope I
> can help at least a little.
>
> As a quick guess I remember that fts was one of the places that allocated
> heavily with some queries, but I can't recall right now.
>
>
There are 88 unique queries run against the database.  They are mostly
automatically created from a class structure.  However, without dumping the
entire database in front of you, the following query should provide a good,
isolated example:

5,008 calls to

UPDATE RESPONSES SET
RESPONSE_NAME = :RESPONSE_NAME,
prelisted_value = :prelisted_value
WHERE RESPONSE_OID = :RESPONSE_OID

3.6.17: 382 ms
3.7.9: 5,924 ms

There are 80,021 rows in RESPONSES.  The CREATE statement for RESPONSES
reads:

CREATE TABLE responses (
  response_oid integer primary key autoincrement,
  response_name varchar,
  definition_parent integer,
  instance_parent integer,
  prelisted_value varchar,
  override_behavior_oid integer,
  soft_deleted_char varchar default 'F' );

CREATE INDEX response_definition_parent_idx on responses
(definition_parent);

CREATE UNIQUE INDEX response_instance_parent_idx on responses
(instance_parent, definition_parent);

CREATE UNIQUE INDEX response_instance_parent_idx2 on responses
(response_name, instance_parent, definition_parent);
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Some pieces of the puzzle are coming together

2012-01-14 Thread John Elrick
On Sat, Jan 14, 2012 at 10:42 AM, Dan Kennedy  wrote:

> On 01/14/2012 04:49 AM, John Elrick wrote:
>
>>
>> So, obviously the problem is that _malloc is being called a much larger
>> number of times by 3.7.9 in this particular circumstance -- roughly 70
>> times as often.  _memcpy demonstrates roughly 6 times as many calls   I
>> ran
>> a test with a much smaller dataset tracking every call to malloc back to
>> the main source line.  The results were as follows:
>>
>> 3.6.17,
>> _sqlite3_exec calls _malloc 1,101 times
>> _sqlite3_step calls _malloc 1,812 times
>>
>> 3.7.9
>>  _sqlite3_exec calls _malloc 65,227 times
>> _sqlite3_step calls _malloc 47,109 times
>>
>> At this point I've exhausted my avenues of research.  Does anyone have any
>> further ideas as to what may be going on which could cause this increase
>> in
>> memory management activity?  Or any suggestions of where to look next?
>>
>>
>>
> Is there a CREATE INDEX statement in your program? There was a change
> a few versions back that caused CREATE INDEX to make a lot more calls
> to malloc.
>
> If you build SQLite with SQLITE_OMIT_MERGE_SORT it bypasses the new
> code and builds indexes the old way - with fewer mallocs.


There are many CREATE INDEX statements which are all run at the beginning
when the database is initially created.  I can run a test with your
suggestion early next week.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Some pieces of the puzzle are coming together

2012-01-14 Thread John Elrick
On Fri, Jan 13, 2012 at 5:58 PM, Richard Hipp  wrote:

> On Fri, Jan 13, 2012 at 4:49 PM, John Elrick  >wrote:
>
> >
> > 3.6.17
> >
> > Procedure   % TimeTimeCalls
> > _sqlite3_step   58.4%  1.795,052
> > _memcpy 22.8%  0.691,342,957
> > _memset  7.8%  0.24  465,299
> > ...
> > _malloc  1.9%  0.06   95,505
> >
> >
> > 3.7.9
> >
> > Procedure   % TimeTimeCalls
> > _malloc 56.9% 44.396,975,335
> > _sqlite3_step   30.4% 23.685,052
> > _memcpy  4.8%  3.707,710,259
> >
>
> Very curious
>
> SQLite has lots of interfaces that can be used to determine performance and
> status information.  Some example code from the command-line shell that
> accesses this status information is here:
> http://www.sqlite.org/src/artifact/aa4183d4a5243d?ln=990-1076
>
> I would very much like to see some of these same stats from your
> application.  Do you think you could add some code like that found in the
> shell and give us some more insight into what is going on?


Thanks for the direction.  I'll start working on it on Monday.


John
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Some pieces of the puzzle are coming together

2012-01-14 Thread John Elrick
On Fri, Jan 13, 2012 at 5:20 PM, Pavel Ivanov  wrote:

> > At this point I've exhausted my avenues of research.  Does anyone have
> any
> > further ideas as to what may be going on which could cause this increase
> in
> > memory management activity?  Or any suggestions of where to look next?
>
> IIRC, SQLite 3.7.9 changed its PCACHE interface and thus page cache
> implementation. It introduced one level of indirection into each page
> in the cache. But or course I don't think it should result in such
> huge amount of additional mallocs.
> Also as you see effects only inside Delphi and not in command line
> utility it suggests me that Delphi installed its own page cache. And
> as it uses old (deprecated in 3.7.9) interface it may have some bad
> interaction with new SQLite.
>
> And just a thought: number of calls to memcpy with 3.7.9 is larger
> than with 3.6.17 roughly on the same amount as number of mallocs.
> Which suggests that all new calls are actually calls to realloc() (is
> number of calls to free() also bigger on 6M?).


The number of calls to _free are 6,968,427 in 3.7.9.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] sqlite3_set_auxdata & invalid pointer

2012-01-14 Thread gwenn
Hello,
I am trying to add custom aggregation function support in a golang driver
(scalar functions are ok).
While testing, I got this:
*** glibc detected *** ./6.out: realloc(): invalid pointer:
0x02daa1c5 ***
=== Backtrace: =
/lib/x86_64-linux-gnu/libc.so.6(+0x72656)[0x2b9a7b5da656]
/lib/x86_64-linux-gnu/libc.so.6(realloc+0x312)[0x2b9a7b5e0762]
/home/gwen/Test/sqlite-autoconf-3070900/.libs/libsqlite3.so.0(+0x30387)[0x2b9a7b2ec387]
/home/gwen/Test/sqlite-autoconf-3070900/.libs/libsqlite3.so.0(+0x16a5b)[0x2b9a7b2d2a5b]
/home/gwen/Test/sqlite-autoconf-3070900/.libs/libsqlite3.so.0(+0x234da)[0x2b9a7b2df4da]
/home/gwen/Test/sqlite-autoconf-3070900/.libs/libsqlite3.so.0(sqlite3_set_auxdata+0xb6)[0x2b9a7b2e2d86]

I just have enough skills to debug with gdb and to find this line:
62056: pVdbeFunc = sqlite3DbRealloc(pCtx->s.db, pVdbeFunc, nMalloc);

Could you please help me find what I am doing wrong?
I just call sqlite3_set_auxdata in my xStep function.
Thanks.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] finding pairwise similarity

2012-01-14 Thread Igor Tandetnik
Mr. Puneet Kishor  wrote:
> I have two tables like so
> 
>CREATE TABLE c (c_no INTEGER PRIMARY KEY, c_name TEXT);
>CREATE TABLE t (t_no INTEGER PRIMARY KEY, t_name TEXT);
>CREATE TABLE c_t (c_no INTEGER, t_no INTEGER);
> 
> Every row in `c` has one or more `t`. I want, in a single SQL, the following
> 
>source_c, target_c, similarity
> 
> where `similarity` is a measure of overlap of `t` between each pair of `c` 
> and is calculated as the number of `t` common between
> a given pair divided by the total number of unique `t` in the pair.

select src.c_no source_c, trg.c_no target_c,
(select count(*) from (
 select t_no from c_t where c_t.c_no = src.c_no
 intersect
 select t_no from c_t where c_t.c_no = trg.c_no)) * 1.0
/
(select count(*) from (
 select t_no from c_t where c_t.c_no = src.c_no
 union
 select t_no from c_t where c_t.c_no = trg.c_no)) similarity
from c src join c trg on (src.c_no < trg.c_no);

-- 
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SSD with TRIM

2012-01-14 Thread Max Vlasov
On Sat, Jan 14, 2012 at 11:52 PM, Simon Slavin  wrote:

>
> On 14 Jan 2012, at 7:24pm, Max Vlasov wrote:
>
> > On Sat, Jan 14, 2012 at 6:12 AM, Simon Slavin 
> wrote:
> >
> >> Fast.  Fasty fast.  Speed is high.  INSERT, UPDATE, and DELETE all
> >> significantly faster.  SELECT is a bit faster, but there's less
> difference.
> >
> > Simon, very interesting. Can you make some tests related to internal
> >
> > INSERT INTO TestTable (id, text) VALUEs (abs(random() % (1 << 48)),
> > '12345678901234567...') {the text is about 512 bytes long}
>
> If you want to supply the CREATE command and tell me that the '...' in the
> above means that that line should continue to 512 digits, I'm happy to run
> the above.  So the speeds are not directly comparable.
>
>
Ok

CREATE TABLE [TestTable] (
[Id] INTEGER PRIMARY KEY AUTOINCREMENT,
[Text] TEXT
)

The string just continues to be 512 bytes in length, I think it can be even
bigger. The idea is for primary key to jump heavily inside the b-tree. As I
understand the test should benefit from infamous almost instant access time
of ssd, but on Asus T91MT having SSD I still noticed degradation toward 50%
of job when the cpu dropped from 80% to 50% (so it's not sqlite who waited)
so either Windows 7 on a 2 years ssd system isn't optimized enough or
something else.

Max
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SSD with TRIM

2012-01-14 Thread Simon Slavin

On 14 Jan 2012, at 7:24pm, Max Vlasov wrote:

> On Sat, Jan 14, 2012 at 6:12 AM, Simon Slavin  wrote:
> 
>> Fast.  Fasty fast.  Speed is high.  INSERT, UPDATE, and DELETE all
>> significantly faster.  SELECT is a bit faster, but there's less difference.
> 
> Simon, very interesting. Can you make some tests related to internal
> fragmentation? As an advanced user of sqlite,  you probably will easily
> invent your own tests :), but if not  there is a test I used at least once
> 
> INSERT INTO TestTable (id, text) VALUEs (abs(random() % (1 << 48)),
> '12345678901234567...') {the text is about 512 bytes long}

If you want to supply the CREATE command and tell me that the '...' in the 
above means that that line should continue to 512 digits, I'm happy to run the 
above.  So the speeds are not directly comparable.

I was running a real webserver app (Apache/PHP/SQLite).  But I was debugging it 
on a new laptop -- a MacBook Pro 2.4GHz Core i7 with SSD -- and noticed an 
improvement in speed over when I debugged it with an older laptop with a hard 
disk.  But the older one has two cores instead of four and the motherboard 
architecture may not be as good.  And the app does quite a few different 
things, and there's a possible ethernet bandwidth bottleneck, so it's not 
suitable as a benchtest.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SSD with TRIM

2012-01-14 Thread Max Vlasov
On Sat, Jan 14, 2012 at 6:12 AM, Simon Slavin  wrote:

>
> Fast.  Fasty fast.  Speed is high.  INSERT, UPDATE, and DELETE all
> significantly faster.  SELECT is a bit faster, but there's less difference.
>
>
Simon, very interesting. Can you make some tests related to internal
fragmentation? As an advanced user of sqlite,  you probably will easily
invent your own tests :), but if not  there is a test I used at least once

INSERT INTO TestTable (id, text) VALUEs (abs(random() % (1 << 48)),
'12345678901234567...') {the text is about 512 bytes long}

1,000,000 inserts should create a 1GB base and the time with general hd was
above 15 minutes, maybe even half an hour. So I'm very interested to see
was there any progress with overall ssd performance.

Thanks,

Max
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] finding pairwise similarity

2012-01-14 Thread Mr. Puneet Kishor
I have two tables like so

CREATE TABLE c (c_no INTEGER PRIMARY KEY, c_name TEXT);
CREATE TABLE t (t_no INTEGER PRIMARY KEY, t_name TEXT);
CREATE TABLE c_t (c_no INTEGER, t_no INTEGER);

Every row in `c` has one or more `t`. I want, in a single SQL, the following

source_c, target_c, similarity

where `similarity` is a measure of overlap of `t` between each pair of `c` and 
is calculated as the number of `t` common between a given pair divided by the 
total number of unique `t` in the pair. For example, given the following records

c   t
--- ---
1   1
1   2
1   3
2   1
2   5

I want the result to be

source_c target_c similarity
  --
1   2 0.2 (= 1 common `t` between the 2 `c` / by total 4 `t`)

Would appreciate any nudge toward a solution.

--
Puneet Kishor
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Some pieces of the puzzle are coming together

2012-01-14 Thread Dan Kennedy

On 01/14/2012 04:49 AM, John Elrick wrote:


So, obviously the problem is that _malloc is being called a much larger
number of times by 3.7.9 in this particular circumstance -- roughly 70
times as often.  _memcpy demonstrates roughly 6 times as many calls   I ran
a test with a much smaller dataset tracking every call to malloc back to
the main source line.  The results were as follows:

3.6.17,
_sqlite3_exec calls _malloc 1,101 times
_sqlite3_step calls _malloc 1,812 times

3.7.9
  _sqlite3_exec calls _malloc 65,227 times
_sqlite3_step calls _malloc 47,109 times

At this point I've exhausted my avenues of research.  Does anyone have any
further ideas as to what may be going on which could cause this increase in
memory management activity?  Or any suggestions of where to look next?




Is there a CREATE INDEX statement in your program? There was a change
a few versions back that caused CREATE INDEX to make a lot more calls
to malloc.

If you build SQLite with SQLITE_OMIT_MERGE_SORT it bypasses the new
code and builds indexes the old way - with fewer mallocs.

Dan.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Slow commits

2012-01-14 Thread Dimiter 'malkia' Stanev



Hi Normand,

Just for sanity, check whether Windows is not creating previous versions 
for you.


Right click on the file in Explorer -> Properties -> Previous Versions.

Also check whether Caching is enabled on your HDD:
Control Panel -> Device Manager -> Disk Drives -> (Click on your HDD)

Then go to the "Policies" tab (should be second). There are two settings:

"Enable write caching on disk" and
"Turn off Windows write-cache buffer flushing on device"

The first one should probably be checked, but second one (if checked) 
would be more dangerous, but faster (at least for me).


It depends how secure your data is. Our data can be regenerated, and 
taken back from P4, and at worst I can lose some hours of work 
(programming).


It's also good idea to check with SysInternals ProcMon what exactly is 
going on.


Another good tool, is SysInternals RamMap - with it you can check what 
portions of the file are in the file-cache (meta-data). You can also 
flush the cache with it, and test various scenarios


Cheers,
Dimiter 'malkia' Stanev,
Senior Software Engineer,
Treyarch, Activision.

On 1/13/2012 3:56 PM, Normand Mongeau wrote:



On 2012-01-13 18:10, Roger Binns wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 13/01/12 14:35, Normand Mongeau wrote:

It gets worse. On a clean empty database, the same 534 transactions
take 140 seconds. That's a not very impressive rate of 3.8 inserts
per second. The FAQ says that SQLite should be able to do a "few dozen
transactions per second". I'd be happy to see that.

What file extension are you using for the database? There is a long list
of extensions that System Restore monitors, and makes backups of the
files
as they change. This will kill your performance.


I was using .db as an extension, and changed it to something ludicrous,
but it didn't make a difference. Good idea though.

Thanks,

Normand




Extension list and terse details are at:

http://msdn.microsoft.com/en-us/library/Aa378870

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAk8Quc4ACgkQmOOfHg372QQAowCgi0DlewfcHs6MIPIHSyjHw6mN
nFIAnjJch3erZfRF+I88yA3CzAkCQWVl
=HoSZ
-END PGP SIGNATURE-
___
sqlite-users mailing list
sqlite-users-czdrofg0bjidnm+yrof...@public.gmane.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Bug: "Invalid statement in fillWindow()" from Explain query plan using subquery

2012-01-14 Thread Luke Stevens
Using SQLite 3.7.4 on Android. To reproduce:

 

create table S (id string, name string);

create table F (id string, sid string);

explain query plan select (select count(*) from F where F.sid=S.id) as
FCount, S.name from S;

 

Result:

 

Empty cursor, with log showing:

SqliteCursor.cpp

Invalid statement in fillWindow()

 

Just running the select, without explain query plan, seems to be OK.

 

Could not find anything like this among existing bugs.

 

Luke

 

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Using non-aggregate columns in group by (with an additional issue)

2012-01-14 Thread David Garfield
A  very minor change on your query is all that is needed.

select max(x.timestamp),x.value,x.person from table as x,
(select person,max(value) as maxvalue from table group by person order by
maxvalue desc) as y
where x.person = y.person
and x.value = y.maxvalue
group by x.person, x.value


I don't know how this compares in terms of performance with Igor's solution.

--David Garfield

On Fri, Jan 13, 2012 at 06:07, Dilip Ranganathan wrote:

> I have a table that looks like something like this:
>
>timestampvalue   person
>===
>2010-01-12 00:00:00   33  emp1
>2010-01-12 11:00:00   22  emp1
>2010-01-12 09:00:00   16  emp2
>2010-01-12 08:00:00   16  emp2
>2010-01-12 12:12:00   45  emp3
>2010-01-12 13:44:00   64  emp4
>2010-01-12 06:00:00   33  emp1
>2010-01-12 15:00:00   12  emp5
>
> I wanted to find the maximum value associated with each person. The obvious
> query was:
>
> select person,max(value) from table group by person
>
> Now I wanted to include the timestamp associated with each max(value). I
> could not use timestamp column in the above query because as everyone
> knows, it won't appear in the group by clause. So I wrote this instead:
>
> select x.timestamp,x.value,x.person from table as x,
> (select person,max(value) as maxvalue from table group by person order by
> maxvalue desc) as y
> where x.person = y.person
> and x.value = y.maxvalue
>
> This works -- to an extent. I now see:
>
>timestampvalue   person
>===
>2010-01-12 13:44:00   64  emp4
>2010-01-12 12:12:00   45  emp3
>2010-01-12 06:00:00   33  emp1
>2010-01-12 00:00:00   33  emp1
>2010-01-12 08:00:00   16  emp2
>2010-01-12 09:00:00   16  emp2
>2010-01-12 15:00:00   12  emp5
>
> The problem is now I get all the entries for emp1 and emp2 that ends up
> with the same max(value).
>
> Suppose among emp1 and emp2, I only want to see the entry with the latest
> timestamp. IOW, I want this:
>
>timestampvalue   person
>===
>2010-01-12 13:44:00   64  emp4
>2010-01-12 12:12:00   45  emp3
>2010-01-12 06:00:00   33  emp1
>2010-01-12 09:00:00   16  emp2
>2010-01-12 15:00:00   12  emp5
>
> What kind of query would I have to write? Is it possible to extend the
> nested query I wrote to achieve what I want or does one have to rewrite
> everything from the scratch?
>
> If its important, timestamps are actually stored as julian days. I use the
> datetime() function to convert them back to a string representation in
> every query.
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users