Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

2018-01-09 Thread Wade, William
If you were going to do this entirely in memory (perhaps in C, or some similar 
language), you would likely use some tree structure where each node keeps track 
of the number of descendants (direct and indirect) of that node. That allows 
the operations you describe to occur in O(log(N)) time. Single-record 
insert/delete/update has the same time complexity.

It is likely that for your tree, every node would have the same structure 
(struct in C), or else every internal node would have one structure, and every 
leaf node would have another structure.

Now given a bunch of objects with the same structure, you can easily store them 
in a relational database, rather than in memory, and perform similar operations 
on them. A collection of struct instances turns into a table, and a C pointer 
turns into a row-id (or similar). This isn't entirely free, of course. In C we 
think of a pointer dereference as occurring in constant time, while in a 
database, a key lookup is typically log(N) time, but still, your log(N) 
in-memory solution becomes a log-squared(N) database solution, and that is 
usually fast enough.

Of course you lose some of the database "convenience. You're essentially 
implementing trees which are close to those that already "free" in sqlite. 
Likewise, some simple SQL queries turn into something more complex (since you 
need to maintain your tree). At least you still get the ACID benefits.

If I google "counting tree in sqlite" I see some hits that, perhaps, already do 
this kind of thing.

Regards,
Bill

-Original Message-
From: Eric Grange [mailto:zar...@gmail.com]
Sent: Tuesday, January 9, 2018 3:51
To: General Discussion of SQLite Database 
Subject: [sqlite] Efficient ways to maintaining order rank / row_number() in a 
rather large set ?

Hi,

I have a problem where I have a large set of (key, value) which I want to sort 
by value, and then store in a table having (rank, key, value) fields, so that 
for a given key I can quickly find the rank, or for a given rank range, I can 
quickly list the keys & values.

Since there is no ROW_NUMBER() function, but there is an autoincrement feature, 
and the rank are numbered 1, 2, 3 etc. the strategy I have been using is to 
create ranked table like

   CREATE RANKED (
  RANK INTEGER PRIMARY KEY AUTOINCREMENT,
  KEY INTEGER,
  VALUE FLOAT
   )

(+ an index for the key)

and then I fill that table with something like

   INSERT INTO RANKED
  SELECT key, value
  FROM ...something rather complex and big...
  ORDER BY value desc

This works well enough, but as the amount of values to be ranked increases, 
this feels wasteful to delete everything and then re-insert everything just to 
adjust the RANK column, also I am running into memory issues as the ORDER BY 
requires a temporary b-tree which runs into the gigabyte range in some 
instances.

I have ways to maintain the KEY and VALUES individually and incrementally, but 
approaches I have tried to maintain the RANK with UPDATE queries ran much 
slower than deleting and recreating everything, though this could just be bad 
implementations from my part.

Are there any other strategies I could use that could update just the RANK 
field and mitigate the temporary B-tree size?

Eric


**
This e-mail and any attachments thereto may contain confidential information 
and/or information protected by intellectual property rights for the exclusive 
attention of the intended addressees named above. If you have received this 
transmission in error, please immediately notify the sender by return e-mail 
and delete this message and its attachments. Unauthorized use, copying or 
further full or partial distribution of this e-mail or its contents is 
prohibited.
**
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Best way to wipe out data of a closed database

2016-10-24 Thread Wade, William
It sounds like you've got a way forward on leaks via the malloc() system within 
the process space.

Be aware that depending on your system (and the attackers' capabilities), you 
might have to worry about other leaks. For instance, if I did a query that 
involved a FLOAT index, and then closed the sqlite session and cleared the 
zero-malloc block, it wouldn't be at all surprising if some float values 
related to database contents, or to my query, were still in

1) The region of the C process stack that was reached by some deep call stack.
2) Processor registers.
3) Process pages that were copied to disk by the OS (this could be a problem 
even if you otherwise have a good handle on clearing malloc() blocks).

Regards,
Bill

-Original Message-
From: Max Vlasov [mailto:max.vla...@gmail.com]
Subject: [sqlite] Best way to wipe out data of a closed database

Hi,

in an application that implements encryption/decryption with VFS, what is the 
best way to ensure that the memory of the application doesn't contain decrypted 
data after the database is closed. So no side application could retrieve 
sensitive information by reading this process memory. Not only the base as a 
whole but also fragments of database sectors anywhere in the process memory 
space.

One of the trick possible is to add additional zeroing out to the global free 
handler, but this can probably introduce performance penalties.

Is there any other way to do this?

Thanks,

Max


**
This e-mail and any attachments thereto may contain confidential information 
and/or information protected by intellectual property rights for the exclusive 
attention of the intended addressees named above. If you have received this 
transmission in error, please immediately notify the sender by return e-mail 
and delete this message and its attachments. Unauthorized use, copying or 
further full or partial distribution of this e-mail or its contents is 
prohibited.
**
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Risk of damaging a non-sqlite file with `sqlite3 `?

2016-10-21 Thread Wade, William
> From: Clemens Ladisch [mailto:clem...@ladisch.de]
> Sent: Friday, October 21, 2016 1:31 AM
> It would be possible to open a 'wrong' file only if someone had created a 
> fake database file deliberately.

I think that is generally correct, but "possible ... only if" is perhaps a bit 
strong.

sqlite3.exe is willing to open and modify an existing empty file. In some sense 
that means that all empty files are sqlite databases. Often, the creator of the 
file had something else in mind.

If I run the unix command

split --bytes=360K foo.sqlite foo.sqlite

so that I can archive the file onto a collection of floppy disks, one of the 
files I get will be named foo.sqliteaa, and that file will have the sqlite 
header string. I'd call it a stretch to say that I deliberately created a fake 
database file.

Regards,
Bill

**
This e-mail and any attachments thereto may contain confidential information 
and/or information protected by intellectual property rights for the exclusive 
attention of the intended addressees named above. If you have received this 
transmission in error, please immediately notify the sender by return e-mail 
and delete this message and its attachments. Unauthorized use, copying or 
further full or partial distribution of this e-mail or its contents is 
prohibited.
**
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-04 Thread Wade, William
A lot of speculation here. I am certainly no SQLite expert.

Your input has 1g positions, taking 187gb, so averaging 187b/position. From 
your CREATE TABLE, it looks like to get that size most of your qfrom and qto 
are fairly long strings. I'm assuming there are a great many duplications in 
those positions. If so, put them into a separate position table { positionId 
INT, positionName TEXT }, with positionId as the primary key and positionName 
also being unique. This will be even more useful if you have a fixed set of 
possible positions, and you make it so that positionId is increasing whenever 
positionName is increasing. In your qmoves table, store positionId values, 
rather than postionName values. Saves a lot of space because no name is in the 
database more than once, and most of your space is related to names.

Space is important, because at a few hundred gb, your database is probably 
bigger than all of your available fast cache's, and you are probably storing 
your data on rotating storage. Writes to random positions might easily average 
10ms, and the import of most of your records may involve one or more such a 
writes by the time indices are updated. Reducing sizes mean fewer such writes, 
because things are more likely to fit in the various caches.

I believe that with SQLite, if you don't specify WITHOUT ROWID your "real" 
record order is based on rowid, so complete records are, perhaps, stored in 
import order. The PRIMARY KEY you specified is large (a name and an integer), 
so the index entries are likely about half as large as a complete record, and 
they don't fit in cache either. It is also likely that your input data was not 
in sorted order. That means that adding an entry to that index likely involves 
a write to a random position. 10ms * 1g = 10e6s, or about three months.

Your qmoves_by_dest index is about as large as your primary (it also has a 
position and an integer), so similar timings might apply.

I suggest trying timing your script on smaller inputs (1gb, 10gb, 20gb) and 
seeing if there is a size where things suddenly get worse (available caches are 
not big enough). See if my suggestions help those timings. See if WITHOUT ROWID 
helps those timings.

In principle, indices can be created by writing the needed information (index 
key, record position) in the original order, and then sorting that into 
key-order. That can be done with many less random seeks (merge sorts involve 
mostly sequential reads and writes). I don't know if, or when, SQLite does that.

Regards,
Bill

-Original Message-
From: Kevin O'Gorman [mailto:kevinogorm...@gmail.com]
Sent: Wednesday, August 03, 2016 10:00 PM
To: sqlite-users
Subject: [sqlite] newbie has waited days for a DB build to complete. what's up 
with this.

I'm working on a hobby project, but the data has gotten a bit out of hand.
I thought I'd put it in a real database rather than flat ASCII files.

I've got a problem set of about 1 billion game positions and 187GB to work on 
(no, I won't have to solve them all) that took about 4 hours for a generator 
program just to write.  I wrote code to turn them into something SQLite could 
import.  Actually, it's import, build a non-primary index, and alter table to 
add a column, all in sqlite3.

The database was growing for about 1-1/2 days.  Then its journal disappeared, 
the file size dropped to zero, but sqlite3 is still running 100% CPU time, now 
for a total of 3800+ minutes (63+ hours).  The database is still locked, but I 
have no idea what sqlite3 is doing, or if it will ever stop.  All partitions 
still have lots of space left (most of this is running in a RAID partition of 
11 TiB).  Here's what I gave to sqlite3 on my Linux system:

time sqlite3 qubic.db 

Re: [sqlite] The upcoming "pi" release of SQLite

2016-07-25 Thread Wade, William
If too late for this release, consider it a request for the next release:

Provide a way (pragma, or additional syntax on the command/virtual table) to 
have the command-line .import, and the new CSV virtual table, use NUMERIC 
column affinity when creating a new table.

I like to think that 10 is greater than 2. Unfortunately .import creates a 
table where that is not true.

Thanks,
Bill

-Original Message-
The next release of SQLite will be the "pi" release - version 3.14.


**
This e-mail and any attachments thereto may contain confidential information 
and/or information protected by intellectual property rights for the exclusive 
attention of the intended addressees named above. If you have received this 
transmission in error, please immediately notify the sender by return e-mail 
and delete this message and its attachments. Unauthorized use, copying or 
further full or partial distribution of this e-mail or its contents is 
prohibited.
**
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Incremental backup/sync facility?

2016-05-09 Thread Wade, William
> From: Stephan Buchert [mailto:stephanb007 at gmail.com]
> Sent: Saturday, May 07, 2016 12:10 AM

> Copying the WAL files is probably more efficient than the SQL text solutions 
> (considering that roughly 5 GB of binary data are weekly added), and it seems 
> easy to implement, so I'll probably try this first. I guess that simply 
> opening the primary database for read before starting the insert/update 
> process would prevent WAL checkpointing until I have a chance to copy the WAL 
> (and SHM) files?

It sounds like that would work. You would need to be careful with your process. 
At a minimum, you need to be sure that full checkpoints don't occur, except in 
conjunction with your copies. It would probably be very easy to do an 
"accidental" checkpoint simply running the sqlite3 command-line tool after a 
reboot. I believe the no-auto-checkpoint setting is per-application, not 
per-database.

I suppose you could change the checkpoint-code so that after a full checkpoint, 
the WAL file is renamed (with a counter suffix), rather than deleted. You 
incremental backup would just need to copy (and perhaps delete) all of the 
renamed WAL files, and process them, in order. That would have the advantage 
that you don't need to let the WAL file grow so large, and you don't have to do 
anything to the main data base while you are doing the incremental backup. 
You'd still have to be careful with your process. Run some sqlite program that 
doesn't have your "changed" checkpoint-code, and you might lose a WAL file.

Regards,
Bill


**
This e-mail and any attachments thereto may contain confidential information 
and/or information protected by intellectual property rights for the exclusive 
attention of the intended addressees named above. If you have received this 
transmission in error, please immediately notify the sender by return e-mail 
and delete this message and its attachments. Unauthorized use, copying or 
further full or partial distribution of this e-mail or its contents is 
prohibited.
**


[sqlite] Random-access sequences

2016-03-01 Thread Wade, William
In RAM, the simple implementation would be to have a balanced tree, ordered by 
index, where every node knows how many elements are below it, and has pointers 
to its children. The tree will have O(logN) depth, following a pointer is O(1), 
and, and all of your operations involve a small (constant-bound) number of 
nodes at each level.

In a typical indexed database, following a pointer (finding the row for a key) 
takes O(logN) time (rather than O(1) time), and it seems most of your 
operations would now cost O((logN)^2) time.

You could hack your database's B-tree implementation, so that each B-Tree page 
"knows" the number of records "below" the page. At that point you would 
essentially be putting the RAM implementation into a file (with an optimistic 
assumption that Read(file, page#) is a constant-time operation in the file 
system).

In either of the above, auto-increment works fine as long as the total number 
enqueued() over the lifetime of the database is not too large, but in addition 
to the autoincrement, you need to update the "parents" of inserted or removed 
records.

Regards

-Original Message-
From: Matthias-Christian Ott [mailto:o...@mirix.org]
Sent: Tuesday, March 01, 2016 10:38 AM
To: sqlite-users at mailinglists.sqlite.org
Subject: Re: [sqlite] Random-access sequences

On 01/03/16 11:59, Matthias-Christian Ott wrote:
> I know that this question is not strictly related to SQLite.
>
> I want to persist a random-access sequence with SQLite. A
> random-access sequence is a queue with random-access to its queued
> elements. Suppose that a random-access sequence [x_0, ..., x_n] has the 
> following operations:
>
> enqueue([x_0, ..., x_n], x) = [x_0, ..., x_n, x] dequeue([x_0, ...,
> x_n]) = [x_1, ..., x_n] lookup([x_0, ..., x_n], i) = x_i update([x_0,
> ..., x_n], i, y) =
>   [x_0, ..., x_(i - 1), y, x_(i + 1), ... x_n] length([x_0, ..., x_n])
> = n + 1

Thinking a bit more about it, I also need the following operation:

delete([x_0, ..., x_n], i) = [x_0, ..., x_(i - 1), x_(i + 1), ... x_n]

So implementing the data structure with autoincrement does not work anymore.

- Matthias-Christian



**
This e-mail and any attachments thereto may contain confidential information 
and/or information protected by intellectual property rights for the exclusive 
attention of the intended addressees named above. If you have received this 
transmission in error, please immediately notify the sender by return e-mail 
and delete this message and its attachments. Unauthorized use, copying or 
further full or partial distribution of this e-mail or its contents is 
prohibited.
**


[sqlite] IS a SQLite db of small size as good as reliable cache?

2016-02-01 Thread Wade, William
For cache eviction, below are you referring to writing dirty pages to disk?

If you are talking about cached reads, the evidence seems to point to many such 
pages remaining in some kind of cache.

I perform a recursive grep (Cygwin) on a large directory tree. It takes 4.5m. I 
wait five minutes and do it again, it takes 0.6m. It is possible that I am 
measuring something else (caching at the disk-driver level, or lower), but for 
the OP's question, that may be good enough.

Win7, 1tb Seagate scsi drive, 5500 rpm. System has 32gb of ram (c: drive is a 
256gb ssd).

Regards,
Bill

-Original Message-
From: Howard Chu [mailto:h...@symas.com]
Sent: Saturday, January 30, 2016 4:23 PM
To: SQLite mailing list
Subject: Re: [sqlite] IS a SQLite db of small size as good as reliable cache?

The Windows cache manager is quite aggressive at evicting cached pages from 
RAM. It used to be tunable back in Win2000, but none of those tuning knobs 
survived past WinXP. Generally, if you access some data, leave it for more than 
5-10 seconds, don't expect to be able to reference it again without incurring a 
hard page fault.

Note that the cache eviction runs quite frequently - once every 5 seconds or 
so, and evicts pages regardless of whether there's any memory pressure in the 
system. It's quite possibly the stupidest cache manager ever written.


**
This e-mail and any attachments thereto may contain confidential information 
and/or information protected by intellectual property rights for the exclusive 
attention of the intended addressees named above. If you have received this 
transmission in error, please immediately notify the sender by return e-mail 
and delete this message and its attachments. Unauthorized use, copying or 
further full or partial distribution of this e-mail or its contents is 
prohibited.
**


[sqlite] INSERT OR IGNORE consuming too many rowids

2015-12-09 Thread Wade, William
I may be reading different documentation (or perhaps just misreading it).

https://www.sqlite.org/autoinc.html says (for the no AUTOINCREMENT case, which 
applies to the original post) "If no ROWID is specified on the insert ... [the] 
usual algorithm is to give the newly created row a ROWID that is one larger 
than the largest ROWID in the table prior to the insert."

And goes on to strongly imply that "not usual" means either the table is 
currently empty, or the largest ROWID is already at the largest possible 
integer. The implication is pretty strong that leaving ROWID unspecified on an 
insert is equivalent (in the usual case) to specifying MAX(ROWID) + 1.

I understand the difference between "documenting some details about how things 
currently happen to behave" and "documenting the intended behavior." When I 
read about the -journal or -wal files, I'm usually happy with the thought that 
in the future SQLite may support (or use exclusively) an entirely different 
rollback mechanism. It isn't entirely obvious which category autoinc.html falls 
into, but I believe that a reasonable reader can interpret that as documenting 
intended behavior.

Best Regards,
Bill

-Original Message-
From: Richard Hipp [mailto:d...@sqlite.org]
Sent: Wednesday, December 09, 2015 7:07 AM
To: SQLite mailing list
Subject: Re: [sqlite] INSERT OR IGNORE consuming too many rowids

As others have a pointed out, though, the "desired" behavior is "undocumented". 
 SQLite nor any other SQL database engine is obligated to provide the behavior 
you want.  It happens as you like purely by chance.  Do not depend on this 
behavior since it might change at any moment, without warning.


**
This e-mail and any attachments thereto may contain confidential information 
and/or information protected by intellectual property rights for the exclusive 
attention of the intended addressees named above. If you have received this 
transmission in error, please immediately notify the sender by return e-mail 
and delete this message and its attachments. Unauthorized use, copying or 
further full or partial distribution of this e-mail or its contents is 
prohibited.
**


[sqlite] Non-transitive numeric equality

2015-11-06 Thread Wade, William
I have an engineering application, where double can be part of a key. In cases 
where I do a comparison that implies a test for equality, I don't necessarily 
care what happens in the really close cases.

SELECT x,y FROM mytable WHERE x >= 1234 AND x < 5678

There may be about a 1 records in that time range, and for engineering 
purposes when I write that I don't particularly care whether I am getting 
1, 10001 or 10002 entries.

However I do depend on getting consistent answers, and on (x >= 1234) being 
disjoint from (x < 1234) and that those two ranges cover the number line. I 
suspect that sqlite makes that kind of guarantee, at least if I explicitly cast 
values to double whenever I do an INSERT or UPDATE or WHERE, although that 
seems like an easy thing to forget to do.

However, nothing in the sqlite documentation promises that sqlite won't 
internally perform an equality test on primary keys, so it seems that I have to 
worry that if equality is not transitive, my database can be in arbitrarily bad 
shape (select gives answers that aren't even close to looking right).

Is the answer for this kind of thing to cast all of my values to double when 
feeding them to sqlite, if I want it to do math using "double" rules?

I suspect sqlite could get the "correct" answers when doing comparisons between 
two types where one type is not a superset of the other by converting both 
values to bigint rationals and comparing those (I think the sqlite numeric 
types are all representable as bigint rationals). That may be overkill (and not 
all that lite).

Regards

-Original Message-
From: Richard Hipp [mailto:d...@sqlite.org]
Sent: Thursday, November 05, 2015 8:55 AM
To: SQLite mailing list
Subject: Re: [sqlite] Non-transitive numeric equality

On 11/5/15, Zsb?n Ambrus  wrote:
>... It seems that equality of numeric values isn't transitive, when both
> integers and reals are involved...

You should not compare floating-point numbers for equality.
Floating-point numbers are, by definition, approximations.  When you compare 
floating-point numbers, therefore, you get an approximate answer.

--
D. Richard Hipp
drh at sqlite.org


**
This e-mail and any attachments thereto may contain confidential information 
and/or information protected by intellectual property rights for the exclusive 
attention of the intended addressees named above. If you have received this 
transmission in error, please immediately notify the sender by return e-mail 
and delete this message and its attachments. Unauthorized use, copying or 
further full or partial distribution of this e-mail or its contents is 
prohibited.
**


[sqlite] How would sqlite read this from disk?

2015-10-29 Thread Wade, William


[sqlite] Any database unique ID across multiple connections ?

2015-09-25 Thread Wade, William
It depends a lot on what kind of duplications you are concerned with. For 
instance, a file-system copy would, presumably, have copied any "unique 
identifier" that may have been in the file, so it will no longer be unique. To 
detect that kind of duplication, I think you really need support at the VFS 
level.

It sounds like you believe that you have a file-system path to both databases. 
If that is the case, you can probably find a way to put a temporary 
"fingerprint" on one file that is unlikely to be "copied" to another file. For 
instance on windows (at least with a situation where LockFileEx() works as 
advertised) I believe you could:

1) Open the first file (read-only works on windows)
2) Select 129 byte addresses that are unlikely to be in use (somewhere near the 
peta-byte address. On windows the file doesn't have to be that big, and you 
wouldn't want to use this if it was.
3) Get an exclusive lock on one of those bytes to indicate that you are doing a 
fingerprint test.
4) Generate a GUID, and come up with a mapping from each bit of the guid to one 
of the other 128 bytes.
5) For each "set" bit of the GUID, gain an exclusive lock on the corresponding 
byte of the file.
6) For each "unset" bit of the GUID, gain a shared lock on the corresponding 
byte of the file.
7) Open the second file.
8) Attempt (non blocking) to gain exclusive and shared locks on the same 129 
byte addresses in the second file.
9) If any attempt to gain an exclusive lock (2nd file) succeeds, the second 
file is different from the first file.
10) If any attempt to gain a shared lock (2nd file) succeeds where the first 
file has an exclusive lock, the files are different.
11) If any attempt to gain a shared lock (2nd file) fails, where the first file 
has a shared lock, the files are different.
12) Otherwise, the files are probably the same.
13) Release all of your locks.

This could fail if somebody (possibly malicious) writes and uses a 
copy-lock-pattern tool. Something in the first six steps could fail if somebody 
else was using those bytes for some reason, but if the first six steps succeed, 
it seems unlikely that the remaining steps would "accidently" fail if the 
underlying file-system and GUID generation are robust.

You could use fewer locks in a larger region to represent a guid (for instance, 
instead of mapping 128 GUID bits to 128 file bytes, map 16 GUID bytes to 16x256 
file bytes).

Regards,
Bill

-Original Message-
From: ALBERT Aur?lien [mailto:aurelien.alb...@alyotech.fr]
Sent: Thursday, September 24, 2015 4:42 AM
To: 'sqlite-users at mailinglists.sqlite.org'
Subject: [sqlite] Any database unique ID across multiple connections ?

Hi,

I'm using SQLite C API and my application use multiple databases, each database 
having multiple connections.

I need to identify which connection use the same database as another connection.

For the moment, I use the absolute database filename as an "unique database 
identifier", and I store this information with my "sqlite3*" connection handle.

Is there any simpler/safer way to get a unique database identifier across 
multiple connections ? (for example, if one day I need the same about 
":memory:" databases, bad things are gonna to happen)

Maybe there is already a database identifier in the SQLite API and I missed it ?

Thanks.



**
This e-mail and any attachments thereto may contain confidential information 
and/or information protected by intellectual property rights for the exclusive 
attention of the intended addressees named above. If you have received this 
transmission in error, please immediately notify the sender by return e-mail 
and delete this message and its attachments. Unauthorized use, copying or 
further full or partial distribution of this e-mail or its contents is 
prohibited.
**


[sqlite] Read data from database slow the first time but then very fast thereafter. How can I improve the speed for the first time read?

2015-08-06 Thread Wade, William
I'm an SQLite novice, so if I'm wrong about this, I hope someone will correct 
me.

The first query will be slower than later queries for the reason Simon 
mentioned. However, it may still be possible to get some speed up of the first 
query, if it can avoid reading stuff not in the result set and also avoid lots 
of wide-spread random seeks. If your 1 million records are the entire database, 
none of these suggestions will help much.

If you "own" the database (can perform writes, or even change the schema) you 
may be able to change it to a form which reads-faster on the first attempt. All 
of these are most likely to help if not a lot of changes have occurred since 
the last VACUUM.

For instance, if your 1 million records are an entire table, but are also only 
1% of the volume of the database as a whole, a VACUUMed database may be faster.

In addition, if your 1 million records are  each small, and extracted from a 
table of 100 million records, then having your table's primary key such-that 
the result set comes from a contiguous range of the original table may help. 
The documentation says that VACUUM makes each table contiguous. It doesn't 
explicitly say that it makes them contiguous in primary-key order, so I don't 
for sure that this would help (but I suspect it does).

Finally, if the table has substantially more columns than you are asking for, 
splitting the table into two tables (the stuff you want, and the stuff you 
don't want) may help.

If nothing else, VACUUM is likely to shrink the database, which speeds loading 
it from disk.

Bill

> From: Simon Slavin [mailto:slavins at bigfraud.org]
> Sent: Tuesday, August 04, 2015 10:46 AM

> On 3 Aug 2015, at 1:58pm, Linquan Bai  wrote:

>> I am trying to read large data from the database about 1 million records.
>> It takes around 1min for the first time read. But if I do the same
>> process thereafter, the time is significantly reduced to 3 seconds.
>> How can I get a fast speed for the first time read?

> You can't.  Some part of your computer has pulled that data into cache, and 
> it's still in the cache when you run the process again, so it the data 
> doesn't need to be fetched from disk again.

> Simon.

**
This e-mail and any attachments thereto may contain confidential information 
and/or information protected by intellectual property rights for the exclusive 
attention of the intended addressees named above. If you have received this 
transmission in error, please immediately notify the sender by return e-mail 
and delete this message and its attachments. Unauthorized use, copying or 
further full or partial distribution of this e-mail or its contents is 
prohibited.
**


[sqlite] Replace an open database

2015-05-27 Thread Wade, William
I have an application file that I'm considering moving to sqlite. Some current 
behavior is:

1) Only one writing thread for the lifetime of a logical dataset.

2) The writing thread opens the file with exclusive write access (open fails if 
there is another writer, no other writers allowed, but one writer plus multiple 
readers is a common situation).

3) The writing thread prefers to delete any existing file. If it can't do that 
(some readers currently have the file open) it gains an exclusive read/write 
lock (consistent with no reader has a transaction in progress) and truncates 
the file to zero length, writes its new header (including his own uuid, 
indicating that this is logically a new file). When existing readers get around 
to reading again, they will check that uuid, and handle the change in writers 
"gracefully."

I'm wondering how to implement that behavior while using sqlite as my 
application file. I can imagine several solutions, but I'm not sure what 
pitfalls to look for. In particular, haven't determined how to get sqlite to 
write-open an existing empty file, as-if it were creating a new database.

Possible solutions:

1) As part of the initial write transaction (at the sqlite client level), use 
sqlite to remove all of the existing tables. There are a couple of issues with 
this. If the old file was 200gb, I'd expect sqlite to "waste a lot of time" 
updating its free-page structures. Also, if the new file only grows to a few 
megabytes, I'd want the excess space to be recovered. I also want the 
write-open to succeed if the existing file is corrupted.

2) Implement my own VFS, such that old readers get read (and lock) errors until 
they open a new connection, while my new writer "sees" this as a new file even 
if the VFS wasn't able to delete the file at the OS level. Since I'm just 
getting started with sqlite, I'd rather not have to dive into the low-level 
guts of implementing a new VFS, if I can help it.

3) After gaining the exclusive lock, truncate the database file (and the 
write-ahead log or rollback journal), create a new database file (somewhere 
else, possibly a virtual database). Perform a raw copy of the new virtual 
database into the old actual database file, toss the virtual database, and then 
use sqlite to open the old (now properly formatted, almost empty) database file 
as a writer.

4) Change all readers so that they close/reopen the database at all transaction 
boundaries (so that when there are no current read transactions, nobody else 
has the file open). I'm a bit worried about the performance implications of 
this for small read transactions (although I have not tested this). I can do 
this for all my existing clients, but I can't do this for somebody using some 
other sqlite3 client (like the command-line client). Also if some non-sqlite 
code has the file open without a lock (perhaps somebody is currently copying 
the file using an os-level command), I'd prefer to make that reader fail, 
rather delay my writer.

Help please?
Bill

**
This e-mail and any attachments thereto may contain confidential information 
and/or information protected by intellectual property rights for the exclusive 
attention of the intended addressees named above. If you have received this 
transmission in error, please immediately notify the sender by return e-mail 
and delete this message and its attachments. Unauthorized use, copying or 
further full or partial distribution of this e-mail or its contents is 
prohibited.
**