Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-22 Thread Jonathan Moules

On 2018-03-22 12:03, Richard Hipp wrote:

On 3/21/18, Jonathan Moules  wrote:

I've spent the last ~90 minutes trying to build this but to no avail

The query planner enhancements are now available in the pre-release
snapshot on the https://sqlite.org/download.html page.


Well, after a lot of fiddling, I finally got something that seems to work.

One observation that's relevant to you; the suggested:

./configure; make sqlite3.c

results in a make error (I'm running it in a Linux Mint VM):

config.status: executing libtool commands
make: Warning: File 'Makefile' has modification time 2.4 s in the future
make: Nothing to be done for 'sqlite3.c'.
make: warning: Clock skew detected. Your build may be incomplete.

No idea what it's on about - I have no clock issues on the host machine 
and the time appears to be correct in the VM. Got around it in the end 
by running it as two commands.


--

Back to the issue at hand using my modestly populated test database - 
all six variations of the query (Order BY ASC; ORDER BY DESC; No Order 
By or LIMIT; - each of those twice, once with JOINs once with LEFT 
JOINs) take the same amount of time in the 3.23.0 build - 0.33seconds. 
(I used the built in ".timer on").


This is the same speed as three of the queries, and faster than one of 
them in 3.15, BUT this is still several times slower than 3.15 on the 
same database (but in Windows) for two of the queries.


The below query, along with the no ORDER BY / LIMIT variant both take 
just 0.08s to run in 3.15.


SELECT
u.url_id, u.url, l.error_code
FROM
urls u
left JOIN
lookups l
USING(url_id)
left JOIN
(select * from v_most_recent_lookup_per_url where url_id in (
select url_id from urls where url = 
'http://catalogue.beta.data.wa.gov.au/api/3/action/resource_search?limit=100=0=format%3AWFS'

)) recent
   -- By definition url's can) recent
-- This is the important bit
-- Here we use the most recent lookup url_id to link to the 
source_seed_id, so we only find its children

-- Or alternatively itself
ON u.source_seed_id = recent.url_id
OR u.url_id = recent.url_id
WHERE
-- For JSON-spider at least, Generic's are guaranteed to be 
generic pages.

l.is_generic_flag = 1
AND
-- Must be "or equal to" so we can get the lookup of the very 
base url.

l.retrieval_datetime >= recent.retrieval_datetime
AND
DATETIME(recent.retrieval_datetime) > DATETIME('now', '-14 
days', 'start of day')

ORDER BY
u.url_id aSC
LIMIT 1;


Although the two "fast" queries are much slower if a non-existent url is 
used (goes up to about 0.5s in 3.15 for the same query)


Cheers,
Jonathan

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-22 Thread Richard Hipp
On 3/21/18, Jonathan Moules  wrote:
> I've spent the last ~90 minutes trying to build this but to no avail

The query planner enhancements are now available in the pre-release
snapshot on the https://sqlite.org/download.html page.
-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-21 Thread Simon Slavin
On 21 Mar 2018, at 10:58pm, Jonathan Moules  
wrote:

> I know SQLite has a CLI tool, but I'm happier with front-ends

You can use the CLI tool (which does still work under Windows 7) to open the 
database you prepared in your preferred environment and execute just the 
statement you're interested in.  You don't have to do the whole task in the CLI.

The advantage of using the CLI is that it doesn't involve any conversions to or 
from the SQLite API.  The SQLite development team wrote the whole thing and can 
completely understand the behaviour of everything it does.

You should work with a copy of the database, not the original, of course.

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-21 Thread Jonathan Moules

Hi Richard,
I've spent the last ~90 minutes trying to build this but to no avail I'm 
afraid. I'm unable to find a version of nmake for Windows 7 (It seems to 
be a VS thing and that in turn is Windows 8/10 only). Then I tried 
inside a VM of Mint, managed to build it, and whilst I was trying to 
find some sort of sqlite front-end (I know SQLite has a CLI tool, but 
I'm happier with front-ends), the VM crashed (for the second time)!


So I'm afraid self-helping by trying that branch isn't happening. If you 
have the means to readily build one and send it to me off-list, feel 
free to and I'll happily test it. If you can't trust the creator of 
SQLite who can you trust? :-)


---

The good news is that Keith's suggestion of removing the "LEFT" from the 
JOIN fixed the ORDER BY DESC issue. The database and query are the same 
as the one's I provided you, just with a bunch of semi-random data in 
the "urls" table.


---

However, - I've now inserted 100,000 semi-random entries into the 
"lookups" table. If I run the same query again (which is unchanged 
except removing the LEFT's from in front of the JOINs), it's going slow 
again taking about 0.5s.
If I change the ORDER BY to ASC, it's also about 0.5s - so they're 
consistent.


That's with this index added which seems to be its preference from the 
numerous variants I created:


CREATE INDEX url_id_datetime_idx ON lookups (

url_id DESC,

retrieval_datetime

);


The things you're likely interested in though, and they may or may not 
be addressed by your branch:

a) If I remove the ORDER BY and LIMIT, the query takes 15 (fifteen) seconds!

b) And if I add the LEFT back in front of the JOIN's, the ORDER BY ASC 
query is back to being modestly speedy - 0.07s - and with no ORDER BY 
it's the same as well.


c) But with the LEFT JOIN's the query takes about 1.1s for ORDER BY DESC

I can provide another copy of the database with the new data in if you 
wish. Or test the fix if you have a dll you want to send me off list.

Thanks,
Jonathan





On 2018-03-21 17:58, Richard Hipp wrote:

On 3/21/18, Jonathan Moules  wrote:

So, I'm back to being stuck on this.
I have inserted 500,000 random urls (no extra lookups - still just
1000), and now the query (as per the below reply) is back to being
somewhat slow (I'm using sqlite 3.15) at about 0.6s per request.

Do you have the ability to compile SQLite from canonical sources?  If
so, please try again with the tip of the join-strength-reduction
branch (https://www.sqlite.org/src/timeline?r=join-strength-reduction).

To compile on Windows:

(1) Download a tarball or ZIP archive (or SQLite Archive) and unpack it.
(2) Run "nmake /f makefile.msc sqlite3.c"

On unix:

(1) Download and unpack as before
(2) ./configure; make sqlite3.c

The only dependence for the above is having a "tclsh" somewhere on your $PATH.




___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-21 Thread Keith Medcalf

and Richards patch merely attempts to detect such errors of query phrasing and 
convert the join type for you ... so that there is no need to generate the 
possibly millions of unnecessary intermediate results ...

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.


>-Original Message-
>From: sqlite-users [mailto:sqlite-users-
>boun...@mailinglists.sqlite.org] On Behalf Of Keith Medcalf
>Sent: Wednesday, 21 March, 2018 12:31
>To: SQLite mailing list
>Subject: Re: [sqlite] How to optimise a somewhat-recursive query?
>
>
>Or just try it with the superfluous outer join keyword (LEFT) removed
>since you are really just doing an inner (equi) join and the outer
>join data is just discarded (by your WHERE clause constraints) after
>it is generated anyway ...
>
>---
>The fact that there's a Highway to Hell but only a Stairway to Heaven
>says a lot about anticipated traffic volume.
>
>
>>-Original Message-
>>From: sqlite-users [mailto:sqlite-users-
>>boun...@mailinglists.sqlite.org] On Behalf Of Jonathan Moules
>>Sent: Wednesday, 21 March, 2018 12:17
>>To: sqlite-users@mailinglists.sqlite.org
>>Subject: Re: [sqlite] How to optimise a somewhat-recursive query?
>>
>>Hi Richard, Simon
>>Re: Compiling - I'm afraid I wouldn't really know where to begin. A
>>quick google finds
>>https://superuser.com/questions/146577/where-do-i-find-nmake-for-
>>windows-7-x64
>>- but the answers/links there don't seem to work. I've got to go-out
>>now
>>but can take another look later and see if I can find a copy
>>(Microsoft
>>(I'm on Windows) never make it easy to find stuff).
>>
>>Simon - I suspected the ORDER BY thing was wrong but wanted to check
>>first rather than simply come out with "SQLite is broken!". This may
>>be
>>related to the 3.22 regression I brought up a couple of days ago
>(and
>>why I'm using 3.15) - probably why Dr H is suggesting I try his
>>branch.
>>I'm executing the query using SQLiteStudio (Or Python).
>>Thanks,
>>Jonathan
>>
>>
>>On 2018-03-21 17:58, Richard Hipp wrote:
>>> On 3/21/18, Jonathan Moules <jonathan-li...@lightpear.com> wrote:
>>>> So, I'm back to being stuck on this.
>>>> I have inserted 500,000 random urls (no extra lookups - still
>just
>>>> 1000), and now the query (as per the below reply) is back to
>being
>>>> somewhat slow (I'm using sqlite 3.15) at about 0.6s per request.
>>> Do you have the ability to compile SQLite from canonical sources?
>>If
>>> so, please try again with the tip of the join-strength-reduction
>>> branch (https://www.sqlite.org/src/timeline?r=join-strength-
>>reduction).
>>>
>>> To compile on Windows:
>>>
>>> (1) Download a tarball or ZIP archive (or SQLite Archive) and
>>unpack it.
>>> (2) Run "nmake /f makefile.msc sqlite3.c"
>>>
>>> On unix:
>>>
>>> (1) Download and unpack as before
>>> (2) ./configure; make sqlite3.c
>>>
>>> The only dependence for the above is having a "tclsh" somewhere on
>>your $PATH.
>>>
>>
>>
>>___
>>sqlite-users mailing list
>>sqlite-users@mailinglists.sqlite.org
>>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
>
>
>___
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-21 Thread Keith Medcalf

Or just try it with the superfluous outer join keyword (LEFT) removed since you 
are really just doing an inner (equi) join and the outer join data is just 
discarded (by your WHERE clause constraints) after it is generated anyway ...

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.


>-Original Message-
>From: sqlite-users [mailto:sqlite-users-
>boun...@mailinglists.sqlite.org] On Behalf Of Jonathan Moules
>Sent: Wednesday, 21 March, 2018 12:17
>To: sqlite-users@mailinglists.sqlite.org
>Subject: Re: [sqlite] How to optimise a somewhat-recursive query?
>
>Hi Richard, Simon
>Re: Compiling - I'm afraid I wouldn't really know where to begin. A
>quick google finds
>https://superuser.com/questions/146577/where-do-i-find-nmake-for-
>windows-7-x64
>- but the answers/links there don't seem to work. I've got to go-out
>now
>but can take another look later and see if I can find a copy
>(Microsoft
>(I'm on Windows) never make it easy to find stuff).
>
>Simon - I suspected the ORDER BY thing was wrong but wanted to check
>first rather than simply come out with "SQLite is broken!". This may
>be
>related to the 3.22 regression I brought up a couple of days ago (and
>why I'm using 3.15) - probably why Dr H is suggesting I try his
>branch.
>I'm executing the query using SQLiteStudio (Or Python).
>Thanks,
>Jonathan
>
>
>On 2018-03-21 17:58, Richard Hipp wrote:
>> On 3/21/18, Jonathan Moules <jonathan-li...@lightpear.com> wrote:
>>> So, I'm back to being stuck on this.
>>> I have inserted 500,000 random urls (no extra lookups - still just
>>> 1000), and now the query (as per the below reply) is back to being
>>> somewhat slow (I'm using sqlite 3.15) at about 0.6s per request.
>> Do you have the ability to compile SQLite from canonical sources?
>If
>> so, please try again with the tip of the join-strength-reduction
>> branch (https://www.sqlite.org/src/timeline?r=join-strength-
>reduction).
>>
>> To compile on Windows:
>>
>> (1) Download a tarball or ZIP archive (or SQLite Archive) and
>unpack it.
>> (2) Run "nmake /f makefile.msc sqlite3.c"
>>
>> On unix:
>>
>> (1) Download and unpack as before
>> (2) ./configure; make sqlite3.c
>>
>> The only dependence for the above is having a "tclsh" somewhere on
>your $PATH.
>>
>
>
>___
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-21 Thread Simon Slavin
On 21 Mar 2018, at 6:17pm, Jonathan Moules  wrote:

> Simon - I suspected the ORDER BY thing was wrong but wanted to check first 
> rather than simply come out with "SQLite is broken!". This may be related to 
> the 3.22 regression I brought up a couple of days ago (and why I'm using 
> 3.15) - probably why Dr H is suggesting I try his branch.
> I'm executing the query using SQLiteStudio (Or Python).

It is not supposed to be possible to speed up a query by adding an ORDER BY 
clause.  The very worst that SQLite is meant to do is pick a fast order for the 
results.  If the only difference between the two queries really is an ORDER BY 
clause, it looks like you've found something unexpected in SQLite.

I don't know what operating system you're using, but I hope Dr H will pilot you 
through a way to provide the developers with a way to reproduce this problem.

You can provide some useful information by using the precompiled binary of the 
sqlite shell tool to give the output from putting "EXPLAIN QUERY PLAN" before 
your SELECT commands:

EXPLAIN QUERY PLAN SELECT
   u.url, l.error_code 

The difference between the two might provide some clues about what's going on.

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-21 Thread David Raymond
To at least try increasing your version from 3.15 to 3.22 you can get 
pre-compiled binaries from the http://www.sqlite.org/download.html page. Go to 
the Precompiled Binaries for Windows section and there's a sqlite-tools which 
includes the Command Line Interface, and .dll's which you can put in your 
...\PythonXX\DLLs (replacing the sqlite3.dll which is in there at the moment)

Then give it a whirl with either the CLI or the new dll in Python.

In addition you can run "explain query plan blah" or the more detailed "explain 
blah" to get a view of what it's doing when it executes the statement "blah"


-Original Message-
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On 
Behalf Of Jonathan Moules
Sent: Wednesday, March 21, 2018 2:17 PM
To: sqlite-users@mailinglists.sqlite.org
Subject: Re: [sqlite] How to optimise a somewhat-recursive query?

Hi Richard, Simon
Re: Compiling - I'm afraid I wouldn't really know where to begin. A 
quick google finds 
https://superuser.com/questions/146577/where-do-i-find-nmake-for-windows-7-x64 
- but the answers/links there don't seem to work. I've got to go-out now 
but can take another look later and see if I can find a copy (Microsoft 
(I'm on Windows) never make it easy to find stuff).

Simon - I suspected the ORDER BY thing was wrong but wanted to check 
first rather than simply come out with "SQLite is broken!". This may be 
related to the 3.22 regression I brought up a couple of days ago (and 
why I'm using 3.15) - probably why Dr H is suggesting I try his branch.
I'm executing the query using SQLiteStudio (Or Python).
Thanks,
Jonathan


On 2018-03-21 17:58, Richard Hipp wrote:
> On 3/21/18, Jonathan Moules <jonathan-li...@lightpear.com> wrote:
>> So, I'm back to being stuck on this.
>> I have inserted 500,000 random urls (no extra lookups - still just
>> 1000), and now the query (as per the below reply) is back to being
>> somewhat slow (I'm using sqlite 3.15) at about 0.6s per request.
> Do you have the ability to compile SQLite from canonical sources?  If
> so, please try again with the tip of the join-strength-reduction
> branch (https://www.sqlite.org/src/timeline?r=join-strength-reduction).
>
> To compile on Windows:
>
> (1) Download a tarball or ZIP archive (or SQLite Archive) and unpack it.
> (2) Run "nmake /f makefile.msc sqlite3.c"
>
> On unix:
>
> (1) Download and unpack as before
> (2) ./configure; make sqlite3.c
>
> The only dependence for the above is having a "tclsh" somewhere on your $PATH.
>


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-21 Thread Jonathan Moules

Hi Richard, Simon
Re: Compiling - I'm afraid I wouldn't really know where to begin. A 
quick google finds 
https://superuser.com/questions/146577/where-do-i-find-nmake-for-windows-7-x64 
- but the answers/links there don't seem to work. I've got to go-out now 
but can take another look later and see if I can find a copy (Microsoft 
(I'm on Windows) never make it easy to find stuff).


Simon - I suspected the ORDER BY thing was wrong but wanted to check 
first rather than simply come out with "SQLite is broken!". This may be 
related to the 3.22 regression I brought up a couple of days ago (and 
why I'm using 3.15) - probably why Dr H is suggesting I try his branch.

I'm executing the query using SQLiteStudio (Or Python).
Thanks,
Jonathan


On 2018-03-21 17:58, Richard Hipp wrote:

On 3/21/18, Jonathan Moules  wrote:

So, I'm back to being stuck on this.
I have inserted 500,000 random urls (no extra lookups - still just
1000), and now the query (as per the below reply) is back to being
somewhat slow (I'm using sqlite 3.15) at about 0.6s per request.

Do you have the ability to compile SQLite from canonical sources?  If
so, please try again with the tip of the join-strength-reduction
branch (https://www.sqlite.org/src/timeline?r=join-strength-reduction).

To compile on Windows:

(1) Download a tarball or ZIP archive (or SQLite Archive) and unpack it.
(2) Run "nmake /f makefile.msc sqlite3.c"

On unix:

(1) Download and unpack as before
(2) ./configure; make sqlite3.c

The only dependence for the above is having a "tclsh" somewhere on your $PATH.




___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-21 Thread Eduardo
On Wed, 21 Mar 2018 17:39:45 +
Jonathan Moules  escribió:

> So, I'm back to being stuck on this.
> I have inserted 500,000 random urls (no extra lookups - still just 
> 1000), and now the query (as per the below reply) is back to being 
> somewhat slow (I'm using sqlite 3.15) at about 0.6s per request.

After the huge insert, did you run ANALYZE?
 
> After a couple of hours of investigation, it's only slow when there is 
> either no ORDER BY, or if I use DESC (which is what I need). If I use 
> ORDER BY u.url ASC - it's near instantaneous.
> 
> I've tried every possible combination of indexes I can think up, 
> including of course with url_id DESC. I've also removed the ORDER BY in 
> the view (and set it to DESC as well), but that made absolutely no 
> difference.

Have your indexes in the last row the primary key or rowid?
 
> I'm a little confused as to why I'm seeing this behaviour - my limited 
> understanding of the query-planner and reading 
> https://sqlite.org/queryplanner.html - suggests that at least when using 
> indexes, it'll simply scan an index backwards if that looks like it'll 
> help. I appreciate this is a result-set not an index, but in this case 
> could not the query planner realise that reading through the results 
> backwards would be faster than whatever it's doing?
> 
> And for that matter, shouldn't no ORDER BY be at least the same speed as 
> ORDER BY u.url_id ASC?

What does an integrity check returns?

Try a Reindex and Analyze, all cases should be similar fast (or slow).

> Thoughts welcome; Thanks!
> Jonathan
> 
> 
> On 2018-03-19 00:24, Jonathan Moules wrote:
> > Thanks Simon and Quan.
> > I'm not sure it's the view itself per-se - It takes 0.000s (time too 
> > small to measure) for just the full View to be run on this dataset.
> >
> > It turns out the problem is simpler than that and no data changes are 
> > needed. I did consider Quan Yong Zhai's option and gave it a try, but 
> > couldn't fathom out the necessary SQL to get what I wanted (it's 
> > getting late).
> >
> > Instead prompted by the replies here, I've changed the query very 
> > slightly to the below which solves the problem:
> >
> > SELECT
> > u.url, l.error_code
> > FROM
> > urls u
> > LEFT JOIN
> > lookups l
> > USING(url_id)
> > LEFT JOIN
> > (select * from v_most_recent_lookup_per_url where url_id in (
> > select url_id from urls where url = 'example.com'
> > )) recent
> > -- By definition url's can) recent
> > -- This is the important bit
> > -- Here we use the most recent lookup url_id to link to the 
> > source_seed_id, so we only find its children
> > -- Or alternatively itself
> > ON u.source_seed_id = recent.url_id
> > OR u.url_id = recent.url_id
> > WHERE
> > -- For JSON-spider at least, Generic's are guaranteed to be 
> > generic pages.
> > l.is_generic_flag = 1
> > AND
> > -- Must be "or equal to" so we can get the lookup of the very 
> > base url.
> > l.retrieval_datetime >= recent.retrieval_datetime
> > AND
> > DATETIME(recent.retrieval_datetime) > DATETIME('now', '-5 days')
> > ORDER BY
> > u.url_id DESC
> > LIMIT 1;
> >
> >
> > To save readers having to compare manually, the difference is this: I 
> > turned the "recent" alias item from the View into a subquery (still 
> > using the view), and then moved the "where url = example.com" part in 
> > to there.
> > The query is now literally two orders of magnitude faster, from 0.2s 
> > to 0.004s. No new indexes or anything, just that change.
> >
> > Hopefully this will scale to full datasets; if it doesn't I may have 
> > to consider the other suggestions, but for now this is a 
> > minimum-effort solution.
> >
> > I'm not actually sure what SQLite was doing in the previous query to 
> > make it take so long. , so I imagine there was some hideous recursing 
> > going on or something.
> >
> > Scope for optimisation?
> >
> > Thanks again,
> > Jonathan
> 
> 
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


-- 
Eduardo 
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-21 Thread Simon Slavin
On 21 Mar 2018, at 5:39pm, Jonathan Moules  wrote:

> After a couple of hours of investigation, it's only slow when there is either 
> no ORDER BY, or if I use DESC (which is what I need). If I use ORDER BY u.url 
> ASC - it's near instantaneous.

You have a query which gets faster when you add an ORDER BY clause ?  That's 
not meant to happen.

Are you using an up-to-date version of SQLite ?

Are you executing the query using the sqlite shell tool ?

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-21 Thread Richard Hipp
On 3/21/18, Jonathan Moules  wrote:
> So, I'm back to being stuck on this.
> I have inserted 500,000 random urls (no extra lookups - still just
> 1000), and now the query (as per the below reply) is back to being
> somewhat slow (I'm using sqlite 3.15) at about 0.6s per request.

Do you have the ability to compile SQLite from canonical sources?  If
so, please try again with the tip of the join-strength-reduction
branch (https://www.sqlite.org/src/timeline?r=join-strength-reduction).

To compile on Windows:

(1) Download a tarball or ZIP archive (or SQLite Archive) and unpack it.
(2) Run "nmake /f makefile.msc sqlite3.c"

On unix:

(1) Download and unpack as before
(2) ./configure; make sqlite3.c

The only dependence for the above is having a "tclsh" somewhere on your $PATH.

-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-21 Thread Jonathan Moules

So, I'm back to being stuck on this.
I have inserted 500,000 random urls (no extra lookups - still just 
1000), and now the query (as per the below reply) is back to being 
somewhat slow (I'm using sqlite 3.15) at about 0.6s per request.


After a couple of hours of investigation, it's only slow when there is 
either no ORDER BY, or if I use DESC (which is what I need). If I use 
ORDER BY u.url ASC - it's near instantaneous.


I've tried every possible combination of indexes I can think up, 
including of course with url_id DESC. I've also removed the ORDER BY in 
the view (and set it to DESC as well), but that made absolutely no 
difference.


I'm a little confused as to why I'm seeing this behaviour - my limited 
understanding of the query-planner and reading 
https://sqlite.org/queryplanner.html - suggests that at least when using 
indexes, it'll simply scan an index backwards if that looks like it'll 
help. I appreciate this is a result-set not an index, but in this case 
could not the query planner realise that reading through the results 
backwards would be faster than whatever it's doing?


And for that matter, shouldn't no ORDER BY be at least the same speed as 
ORDER BY u.url_id ASC?


Thoughts welcome; Thanks!
Jonathan


On 2018-03-19 00:24, Jonathan Moules wrote:

Thanks Simon and Quan.
I'm not sure it's the view itself per-se - It takes 0.000s (time too 
small to measure) for just the full View to be run on this dataset.


It turns out the problem is simpler than that and no data changes are 
needed. I did consider Quan Yong Zhai's option and gave it a try, but 
couldn't fathom out the necessary SQL to get what I wanted (it's 
getting late).


Instead prompted by the replies here, I've changed the query very 
slightly to the below which solves the problem:


SELECT
u.url, l.error_code
FROM
urls u
LEFT JOIN
lookups l
USING(url_id)
LEFT JOIN
(select * from v_most_recent_lookup_per_url where url_id in (
select url_id from urls where url = 'example.com'
)) recent
-- By definition url's can) recent
-- This is the important bit
-- Here we use the most recent lookup url_id to link to the 
source_seed_id, so we only find its children

-- Or alternatively itself
ON u.source_seed_id = recent.url_id
OR u.url_id = recent.url_id
WHERE
-- For JSON-spider at least, Generic's are guaranteed to be 
generic pages.

l.is_generic_flag = 1
AND
-- Must be "or equal to" so we can get the lookup of the very 
base url.

l.retrieval_datetime >= recent.retrieval_datetime
AND
DATETIME(recent.retrieval_datetime) > DATETIME('now', '-5 days')
ORDER BY
u.url_id DESC
LIMIT 1;


To save readers having to compare manually, the difference is this: I 
turned the "recent" alias item from the View into a subquery (still 
using the view), and then moved the "where url = example.com" part in 
to there.
The query is now literally two orders of magnitude faster, from 0.2s 
to 0.004s. No new indexes or anything, just that change.


Hopefully this will scale to full datasets; if it doesn't I may have 
to consider the other suggestions, but for now this is a 
minimum-effort solution.


I'm not actually sure what SQLite was doing in the previous query to 
make it take so long. , so I imagine there was some hideous recursing 
going on or something.


Scope for optimisation?

Thanks again,
Jonathan



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-18 Thread Jonathan Moules

Thanks Simon and Quan.
I'm not sure it's the view itself per-se - It takes 0.000s (time too 
small to measure) for just the full View to be run on this dataset.


It turns out the problem is simpler than that and no data changes are 
needed. I did consider Quan Yong Zhai's option and gave it a try, but 
couldn't fathom out the necessary SQL to get what I wanted (it's getting 
late).


Instead prompted by the replies here, I've changed the query very 
slightly to the below which solves the problem:


SELECT
u.url, l.error_code
FROM
urls u
LEFT JOIN
lookups l
USING(url_id)
LEFT JOIN
(select * from v_most_recent_lookup_per_url where url_id in (
select url_id from urls where url = 'example.com'
)) recent
-- By definition url's can) recent
-- This is the important bit
-- Here we use the most recent lookup url_id to link to the 
source_seed_id, so we only find its children

-- Or alternatively itself
ON u.source_seed_id = recent.url_id
OR u.url_id = recent.url_id
WHERE
-- For JSON-spider at least, Generic's are guaranteed to be 
generic pages.

l.is_generic_flag = 1
AND
-- Must be "or equal to" so we can get the lookup of the very 
base url.

l.retrieval_datetime >= recent.retrieval_datetime
AND
DATETIME(recent.retrieval_datetime) > DATETIME('now', '-5 days')
ORDER BY
u.url_id DESC
LIMIT 1;


To save readers having to compare manually, the difference is this: I 
turned the "recent" alias item from the View into a subquery (still 
using the view), and then moved the "where url = example.com" part in to 
there.
The query is now literally two orders of magnitude faster, from 0.2s to 
0.004s. No new indexes or anything, just that change.


Hopefully this will scale to full datasets; if it doesn't I may have to 
consider the other suggestions, but for now this is a minimum-effort 
solution.


I'm not actually sure what SQLite was doing in the previous query to 
make it take so long. , so I imagine there was some hideous recursing 
going on or something.


Scope for optimisation?

Thanks again,
Jonathan

On 2018-03-18 23:37, Simon Slavin wrote:

On 18 Mar 2018, at 11:13pm, Jonathan Moules  
wrote:


Given there's such a small amount of data at this point, I suspect the issue is 
more related to the recursion. I've tried creating these two indexes to 
facilicate that

Nice idea but I can see why it's not working.

You have an underlying problem: the format you're using to store your data makes it 
extremely difficult to extract the figures you want.  Quan Yong Zhai has the best idea 
I've seen: get rid of almost ¾ of the work you're doing by storing the last retrieval 
date in your "urls" table.

As an alternative, store the start your retrieval process by JOINing the two 
tables together.  Consider the result of this query

SELECT url_id, lookups.error_code
 FROM urls
 JOIN lookups ON lookups.url_id = urls.url_id AND lookup.retrieval_datetime 
= urls.retrieval_datetime
 WHERE DATETIME(urls.retrieval_datetime) > DATETIME('now', '-5 days')

and figure out what you would add to that to get your desired result.

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users




___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-18 Thread Simon Slavin
On 18 Mar 2018, at 11:13pm, Jonathan Moules  
wrote:

> Given there's such a small amount of data at this point, I suspect the issue 
> is more related to the recursion. I've tried creating these two indexes to 
> facilicate that

Nice idea but I can see why it's not working.

You have an underlying problem: the format you're using to store your data 
makes it extremely difficult to extract the figures you want.  Quan Yong Zhai 
has the best idea I've seen: get rid of almost ¾ of the work you're doing by 
storing the last retrieval date in your "urls" table.

As an alternative, store the start your retrieval process by JOINing the two 
tables together.  Consider the result of this query

SELECT url_id, lookups.error_code
FROM urls
JOIN lookups ON lookups.url_id = urls.url_id AND lookup.retrieval_datetime 
= urls.retrieval_datetime
WHERE DATETIME(urls.retrieval_datetime) > DATETIME('now', '-5 days')

and figure out what you would add to that to get your desired result.

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-18 Thread Jonathan Moules

Hi Both,
Thanks for your thoughts.

> SQLite has no DATETIME or BOOLEAN datatype

I'm aware that DATETIME (and BOOLEAN) isn't an actual datatype in 
SQLite, but it functions exactly as you'd expect - I've used them 
extensively in the past and never had a problem - I'm actually storing 
ISO8601 strings and so it's simply a synonym for the TEXT - I find 
having human-readable datetimes infinitely more practical for 
development purposes. It's one of the things I most like about SQLite - 
datetimes are ridiculously easy compared to "proper" databases.


> Drop the ORDER BY; it is useless in a view used in another query,
I figured sqlite would optimise it out if it didn't think it needed it 
(it's needed by some of the other things that use that view). A quick 
test seems to confirm it. It seems that the ORDER BY becomes the "3
Noop54000 " line in the EXPLAIN report - otherwise 
the explain is identical.


> If SQLite thinks that an AUTOMATIC index is useful, you should 
consider creating it explicitly: CREATE INDEX lookup_generid_id_idx ON 
lookups(is_generic_flag, url_id);


It does make the EXPLAIN plan 6 lines shorter, but no apparent speed 
difference unfortunately.


Given there's such a small amount of data at this point, I suspect the 
issue is more related to the recursion. I've tried creating these two 
indexes to facilicate that:

CREATE INDEX url_id_source_id_idx ON urls (

url_id,

source_seed_id

);


CREATE INDEX source_id_url_id_idx ON urls (

url_id,

source_seed_id

);


SQLite will use the former (url_id, then source_seed_id), but it makes 
absolutely no difference to the speed.



So I'm still stumped.



On 2018-03-18 22:30, Tim Streater wrote:

On 18 Mar 2018, at 21:48, Jonathan Moules  wrote:


CREATE TABLE lookups (
  url_id   INTEGER  REFERENCES urls (url_id),
  retrieval_datetime   DATETIME,
  error_code   INTEGER,
  is_generic_flag  BOOLEAN  -- about one in 100 will have this
flagged
);

SQLite has no DATETIME or BOOLEAN datatype, see:



Given this, I don't know what your attempts to compare date/times will do. 
Personally I store all date/times as seconds since the epoch; it simplifies 
life a lot.






___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-18 Thread Quan Yong Zhai
Drop the view, add a new column “last_retrieval_datetime” to the table urls, 
and whenever fetch a page, update this column

发自我的 iPhone

> 在 2018年3月19日,06:15,Clemens Ladisch  写道:
> 
> I have not looked at the schema and queries in detail.
> But at a first glance:
> 
>> CREATE VIEW v_most_recent_lookup_per_url AS
>> ...
>> ORDER BY url_id;
> 
> Drop the ORDER BY; it is useless in a view used in another query,
> and just might slow things down.
> 
>> 011SEARCH TABLE lookups AS l USING AUTOMATIC COVERING INDEX 
>> (is_generic_flag=? AND url_id=?)
> 
> If SQLite thinks that an AUTOMATIC index is useful, you should consider
> creating it explicitly:
>  CREATE INDEX lookup_generid_id_idx ON lookups(is_generic_flag, url_id);
> 
> 
> Regards,
> Clemens
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-18 Thread Tim Streater
On 18 Mar 2018, at 21:48, Jonathan Moules  wrote:

> CREATE TABLE lookups (
>  url_id   INTEGER  REFERENCES urls (url_id),
>  retrieval_datetime   DATETIME,
>  error_code   INTEGER,
>  is_generic_flag  BOOLEAN  -- about one in 100 will have this 
> flagged
> );

SQLite has no DATETIME or BOOLEAN datatype, see:

   

Given this, I don't know what your attempts to compare date/times will do. 
Personally I store all date/times as seconds since the epoch; it simplifies 
life a lot.



-- 
Cheers  --  Tim
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How to optimise a somewhat-recursive query?

2018-03-18 Thread Clemens Ladisch
I have not looked at the schema and queries in detail.
But at a first glance:

> CREATE VIEW v_most_recent_lookup_per_url AS
>  ...
>  ORDER BY url_id;

Drop the ORDER BY; it is useless in a view used in another query,
and just might slow things down.

> 011SEARCH TABLE lookups AS l USING AUTOMATIC COVERING INDEX 
> (is_generic_flag=? AND url_id=?)

If SQLite thinks that an AUTOMATIC index is useful, you should consider
creating it explicitly:
  CREATE INDEX lookup_generid_id_idx ON lookups(is_generic_flag, url_id);


Regards,
Clemens
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] How to optimise a somewhat-recursive query?

2018-03-18 Thread Jonathan Moules

Hi List,
I have a query that's quite slow on a small sample of data and I'd like 
to optimise it before larger datasets get in there. With just 1000 rows 
in each table I'm currently seeing query times of ~0.2 seconds. My 
attempts to optimise it so far by putting in carefully thought out 
indexes have _increased_ the time to 0.4s! H, so much for "carefully 
thought out". So I've removed all the indexes again and am asking the 
experts here for some guidance.


The reduced table structure is as below. It's for some web-crawling - 
basically each URL will be stored once in the "urls" table. It also 
keeps a reference to what the original source url that ended up coming 
to this url was (that's source_seed_id - it's used in this query), and 
the direct parent (parent_id).
Then any time the application scrapes one of the url's, an entry is 
written into the lookups table.


The idea behind the query is to see whether during the last crawl the 
application got to the last page in the paginated "chain" (it will have 
a specific error code which the application is looking for). This means 
it's somewhat recursive.

SQLite IDE version (where the timings are coming from): 3.15.0
SQlite version behind the application itself (not timed the query here, 
but it's still slow): 3.22.0



CREATE TABLE urls (
url_id INTEGER  PRIMARY KEY AUTOINCREMENT,
urlTEXT UNIQUE,
parent_idINTEGER  REFERENCES urls (url_id)
source_seed_id INTEGER  REFERENCES urls (url_id)
);


CREATE TABLE lookups (
url_id   INTEGER  REFERENCES urls (url_id),
retrieval_datetime   DATETIME,
error_code   INTEGER,
is_generic_flag  BOOLEAN  -- about one in 100 will have this 
flagged

);


-- This view is used as a convenience, including in the problem query below

CREATE VIEW v_most_recent_lookup_per_url AS
SELECT url_id,
   MAX(retrieval_datetime) AS retrieval_datetime
  FROM lookups
 GROUP BY url_id
 ORDER BY url_id;


The query itself:


SELECT
u.url, l.error_code
FROM
urls u
LEFT JOIN
lookups l
USING(url_id)
LEFT JOIN
v_most_recent_lookup_per_url recent
-- This is the important bit
-- Here we use the most recent lookup url_id to link to the 
source_seed_id, so we only find its children

-- Or alternatively itself
ON u.source_seed_id = recent.url_id
OR u.url_id = recent.url_id
WHERE
l.is_generic_flag = 1
AND
-- Must be "or equal to" so we can get the lookup of the very 
base url.

l.retrieval_datetime >= recent.retrieval_datetime
AND
DATETIME(recent.retrieval_datetime) > DATETIME('now', '-5 days')
AND
recent.url_id in (
select url_id from urls where url = 'example.com'
)
-- By definition url's can only be inserted in pagination order, so 
we can order by url_id.

ORDER BY
u.url_id DESC
LIMIT 1;



The current EXPLAIN QUERY PLAN:

100SCAN TABLE lookups
100USE TEMP B-TREE FOR GROUP BY
000SCAN TABLE urls AS u
011SEARCH TABLE lookups AS l USING AUTOMATIC COVERING INDEX 
(is_generic_flag=? AND url_id=?)

022SCAN SUBQUERY 1 AS recent
000EXECUTE LIST SUBQUERY 2
200SEARCH TABLE urls USING COVERING INDEX 
sqlite_autoindex_urls_1 (url=?)



And the EXPLAIN:


0Init0118000
1Integer501000
2Once050000
3Noop54000
4OpenEphemeral22000
5SorterOpen620k(1,B)00
6Integer06000
7Integer05000
8Null09900
9Gosub848000
10OpenRead370300
11Rewind317000
12Column301100
13Column321200
14MakeRecord1121300
15SorterInsert613000
16Next312001
17Close30000
18OpenPseudo713200
19SorterSort650000
20SorterData613700
21Column701000
22Compare9101k(1,B)00
23Jump24282400
24Move109100
25Gosub739000
26IfPos650000
27Gosub848000
28Column711400
29CollSeq1500(BINARY)00
30AggStep00143max(1)01
31If1533000
32Column70200
33Integer15000
34SorterNext620000
35Gosub739000
36Goto050000
37Integer16