large amounts of data and cgi

Bernard Devlin Tue, 15 Jan 2008 15:32:59 -0800

Part of a web application I'm working on requires a large number of
queries joining several tables, and each table containing millions of
rows.  Accessing the database  (Frontbase) using either the JDBC
drivers or the Rev drivers provided by Frontbase, resulted in single
queries that would take anywhere between 300-900ms.  As there could be
hundreds of such queries in order to produce one response, I had to
look at ways to speed this up.  I don't believe that it is a question
of writing better SQL - the queries have to spider across the data,
and keep a track of the route they take to get to each node in the
graph.


I tuned the database, making sure that all the 'where' columns were
indexed, and checked the query plan to ensure that it was precisely
those indexes being used in the queries.  I gave the database 100mb of
RAM cache, and even tried loading the core tables 100% into RAM.  But
still I could not really get much better performance.  From my
testing, the data did not seem to be found in RAM until after the
first time it was retrieved from disk, despite my setting the tables
to pre-load in RAM.  Perhaps over time RAM-cached data would have
shown a much better performance than at this stage.

I decided to try HSQLDB - a rdbms written in Java, and which can
contain all its tables in RAM, so I could be sure that the data was in
RAM.  After exporting the core tables to CSV format (and transforming
them to  SQL Inserts using Rev), it took about 1 hour to start the
database, and when started the database took 1.5gb of RAM.  But the
queries were still slow (my guess is that HSQLDB is just not designed
to take that much data).

So, I took the CSV data and imported it into a stack, turning each
table into a custom property set.  I could now search the data in a
tiny fraction of a second -- literally.  The hundreds of queries
involved in making a single response are completed in 200ms.  That's
right -- I can now run the entire series of queries in less time than
it took to do a single query accessing a rdbms.

Obviously, custom properties being used in this way are not
appropriate if one needs referential integrity, locking, concurrent
access, etc.  But for a read-only query, Rev has provided a very
satisfactory solution.  Moreover, whilst HSQLDB would take an hour and
1.5gb of RAM, Rev could load the 100mb of CSV data in 2-4 secs, making
the frequent updating of data and subsequent restarting of the
application feasible.

After following Pierre Sahores' tutorial on creating persistent Rev
CGI processes, I now have these queries returning a web page in 200ms,
instead of the 30 secs they would have taken using Java and a
database.  I know all of this flies in the face of many people's
expectations, and it is certainly not the position I saw myself being
in 2 weeks ago.  The rest of the application is written in Java, but
this core feature has been made possible using Rev.  I  expect that
associative arrays in e.g. PHP might also be as fast in this regard.

Bernard
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

large amounts of data and cgi

Reply via email to