Re: [sqlite] Database to SQLite Population

Andy Goth Sat, 05 Apr 2014 17:08:12 -0700

On 4/5/2014 12:33 PM, [email protected] wrote:

The assumption is that the networked database, datasource, could be on
the local lan or Internet.


So am I to understand that you indeed are concerned (1) that the pipe
could be slow, and (2) that the server may be heavily loaded?  Alright.

The 'snapshot' would not necessarily be everything, but based on a
SELECT statement of a set of the datasource content.


Okay, that's good.

The benefit I see from this local file/memory database is that I have
found some processing of data for analysis occurs over and over to
derive comparison results. By having the data local the user can
perform these analysis without constantly re-querying the production
database.


That makes sense, though there are costs, but you say...

It is assumed that the user knows that data can be stale at any point
beyond the initial load.


Okay, also good that this limitation is known and accepted.

The analysis tools can also remain unchanged since the data is still
coming from a RDBM.


Yes, good point.  This is definitely a massive plus over rolling your
own caching mechanisms.

The only reason the queue was considered is because it is quite
conceivable that a network datasource would be the choke point so a
queue being filled by several threads in process 1 would speed up the
population.


I'm trying to understand your statement because at first glance it seems
contradictory.  You assert the data source may be slow, but the solution
you present is normally used when interfacing with a slow data *sink*.

My best guess is you're considering simultaneously running multiple
source queries simultaneously to better utilize (monopolize, really) the
network, and this design indeed typically needs a queue to serialize its
output so it can be fed to a sink that accepts only one thing at a time.

I have two answers for you.  Pick whichever you like.  I apologize in
advance for the level of detail in answer #2.  I just want to be clear
so you can make the right decision.  Also I can't help but plug the
Wibble web server (featuring Tcl coroutines) since it's my baby.

Answer 1: Keep it simple.  Have one connection only, and just write
whatever you receive as you get it.  SQLite will not be your bottleneck.
One, it's very fast.  Two, you already say the source is the choke
point.  No need to complicate things.

Answer 2: You really do need to have multiple connections at a time, and
you're willing to have a more complex system to support this approach.

Well, SQLite can be used in a thread-safe manner, and it has a
write-ahead log (WAL) mode (not on by default) which improves the
performance of simultaneous writes.

That doesn't sound so bad, but remember that you're going to have to
write code to manage making lots of nonoverlapping subqueries as well as
handling the various failures that could crop up, going back and redoing
queries.  And you almost certainly will not be able to guarantee
atomicity across all your simultaneous connections, so they could very
well give incoherent data due to real-time changes to the database.

If you still want to proceed, there's more to think about.

What you seem to describe does not actually require true threads because
when connected to the Internet (via a typical ISP), you'll only actually
be able to receive one thing at a time.  You mention MyJSQLView in your
signature, and I assume that is related to Java.  Back in 2004 or so
Java got support for something I think they called NIO which imitates
the traditional and surprisingly powerful Unix select() mechanism
thereby facilitating I/O multiplexing within a single thread.  At the
time, it wasn't as advanced as Tcl's I/O event mechanism, but with some
work I was able to develop a Tcl-like framework on top of it.  Anyway,
if you were to use this, your process would resemble:

1. Allocate a predetermined number of connection state structures
2. If a connection is closed, open it
3. Use select() or NIO to wait for incoming data or other events
4. On connect succeed, submit the (sub-)query
5. On connection termination, mark the connection structure as closed
6. On writability and query request data is waiting to be sent, send it
7. On readability, receive and parse data, and load into SQLite
8. Go back to 3 until all expected data is received
9. Terminate when all expected data is received

As you can see, there is quite a bit of complexity, which is why I
prefer the Tcl way which is easier to explain:

1 . Open connections asynchronously
2 . Enter event loop
3a. Handler for connect success sends request query;
3b. and registers readability handler
4a. Handler for readability tries to read;
4b. on success, it loads data into SQLite;
4c. on failure, it closes and restarts connection as in step 1
5 . Terminate when all expected data is received

My old NIO wrapper made this paradigm work in Java, though a native
approach may have been made it into the offical class library in the
past decade.  You could also use Twisted which provides a similar
concept implemented in terms of NIO.

Tcl 8.6 coroutines make this code easier to write in a manner that
resembles multithreaded coding.  I used these techniques to develop the
Wibble web server which does exactly this, except handles the server
side and speaks HTTP.

These import routines are exactly how I have had to populate a local
SQLite database when testing analysis plugins. The process of setting
up the database table(s) exporting from datasource data and importing
to the local file/memory database would be much simply for non-expert
users if automated so they can focus on deriving results from analysis
with a local higher performance file/memory database.


So to sum up, you want to repeatedly analyze one or more data sets which
just so happen to have been read from a remote database, but you don't
want to make that remote database do all the work because it could be at
the nether end of the Internet.  You want to do this analysis using your
existing codebase which was designed to operate using SQL.  You see
SQLite as a good fit because it's compatible (useful subset of SQL) and
is trivial to set up (link it into your program and you're done).  That
all seems totally reasonable to me.

--
Andy Goth | <andrew.m.goth/at/gmail/dot/com>
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Database to SQLite Population

Reply via email to