Re: [sqlite] speeding up row by row lookup in a large db

2009-03-21 Thread P Kishor
Found the bottleneck! First, the results with the "correction" --

Db remains on disk:
-
[06:01 PM] ~/Data/carbonmodel$perl carbonmodel.pl
Prepare load testing
timethis 1000:  3 wallclock secs ( 0.23 usr +  0.41 sys =  0.64 CPU) @
1562.50/s (n=1000)

or 0.63 ms per SELECT

Db is transferred into RAM:
---
[05:51 PM] ~/Data/carbonmodel$perl carbonmodel.pl
Creating in memory tables... done.
Transferring data to memory... done. Took: 90 wallclock secs (76.05
usr +  8.33 sys = 84.38 CPU)
Creating indexes... done. Took: 36 wallclock secs (23.98 usr + 11.29
sys = 35.27 CPU)
Prepare load testing
timethis 1000:  0 wallclock secs ( 0.07 usr +  0.00 sys =  0.07 CPU) @
14285.71/s (n=1000)
(warning: too few iterations for a reliable count)

or 0.07 ms per SELECT.


Ok. So what is going on. Well, the most excellent Devel::NYTProf (if
you haven't seen it in action, you have no idea what you are missing)
showed that retrieving the weather data was chewing up all the time.
See

http://carbonmodel.org/nytprof/carbonmodel.pl-sub.html#901

Taking that out causes the SELECTs to jump up.

Of course, that is not a solution, because I do need the weather data,
but it is a lot of it... basically, each cell has 7300 rows by 9
columns of data (all INTEGER and REAL).

The query I have for retrieving the weather data is

SELECT year, y_day, t_max, t_min, t_ave, precip, vpd, s_rad, day_len
FROM temporal
WHERE met_id IN (:list_of_met_ids)

There is an index on met_id, but whether I have

WHERE met_id IN (3)

or

WHERE met_id = 3

I get the same result... about 28-30 ms retrieving all 20 years worth
of daily weather data. So, now I can focus my energy on streamlining
that particular part of the problem. Maybe I have to simply rethink my
table. As is, the weather table looks like so

table: temporal
--
temporal_id INTEGER PRIMARY KEY
bunch of other columns
met_id INTEGER

There are 400 sets of 7300 rows (hence, 400 unique met_ids in 2.92
million rows). Each cell is related to one and only one met_id, while
each met_id, of course, can refer to many cells. Once I retrieve a
cell, I know its met_id from the cells table. Then I can look up in
the temporal table for the 7300 rows that belong to that cell.

Of course, I could have 400 different temporal tables, but I would
still need to retrieve those. Might be worth a shot to see if that
helps, but, oh, how messy.

But, read on for specific responses to Stefan's post...


On Sat, Mar 21, 2009 at 5:42 PM, Stefan Evert  wrote:
>
> On 21 Mar 2009, at 15:31, P Kishor wrote:
>
>> I did some benchmarking with the above schema using Perl DBI, and I
>> get about 30 transactions per second as long as I returning the data
>> to memory.
>
> Even for Perl/DBI, that seems pretty slow.  Depends on how much data each of
> these transactions returns, though -- if there are thousands of rows in lc
> or dist for each cell_id, then you can't expect much better performance.
>  Even though DBI and DBD::SQLite are written in C, they have to allocate
> fairly complex data structures to return the data (in the best case, an
> anonymous array with 40 to 60 entries for each data row in the result set),
> and these operations are relatively expensive in Perl (I know because I've
> written some XS code recently that does this kind of thing).
>

Stefan, you were right about "Depends on how much data" part, but it
applied to the weather data, not the lookup tables for lc (or
landcover) or dist (or disturbance).

Also, "Even for Perl/DBI" makes it seem there is something wrong with
Perl/DBI... no, no... Perl/DBI is an industry-tested, extremely
capable interface. I wouldn't trade it for anything. The problem was
with my problem, with my data size, with my approach. I have to
correct that. DBD::SQLite/DBD::SQLite::Amalgamation/Perl DBI are the
greatest things since SQLite.

> Another thing to keep in mind is that the SQLite version included in the
> DBD::SQLite distribution is fairly old (3.4.0 on my Mac -- I doubt there's a
> more recent version of DBD::SQLite around), and AFAIK there have been some
> speed improvements in SQLite recently.

I am using DBD::SQLite::Amalgamation with SQLite 3.6.11. I am not sure
there is anything wrong with DBD::SQLite 1.1.4 at all (other than its
use of the older SQLite code, which is easily corrected). One might
get better bang by ensuring that the latest version of DBI is
installed, which Tim Bunce and company are probably always
fine-tuning.



>
> (Darren, any news from the maintainer of DBD::SQLite?  I would be very
> delighted and grateful to be able to use an up-to-date SQLite version in my
> Perl scripts.)
>
>>
>> [1] First retrieve all data from cell table
>> SELECT * FROM cell WHERE cell_id = :cell_id
>>
>> [2] Now retrieve the related lc, dist and met
>> SELECT lc.*
>> FROM lc l JOIN cell_lc c on l.lc_id = c.lc_id
>> WHERE c.cell_id = :cell_id
>
> Just a short 

Re: [sqlite] speeding up row by row lookup in a large db

2009-03-21 Thread Stefan Evert

On 21 Mar 2009, at 15:31, P Kishor wrote:

> I did some benchmarking with the above schema using Perl DBI, and I
> get about 30 transactions per second as long as I returning the data
> to memory.

Even for Perl/DBI, that seems pretty slow.  Depends on how much data  
each of these transactions returns, though -- if there are thousands  
of rows in lc or dist for each cell_id, then you can't expect much  
better performance.  Even though DBI and DBD::SQLite are written in C,  
they have to allocate fairly complex data structures to return the  
data (in the best case, an anonymous array with 40 to 60 entries for  
each data row in the result set), and these operations are relatively  
expensive in Perl (I know because I've written some XS code recently  
that does this kind of thing).

Another thing to keep in mind is that the SQLite version included in  
the DBD::SQLite distribution is fairly old (3.4.0 on my Mac -- I doubt  
there's a more recent version of DBD::SQLite around), and AFAIK there  
have been some speed improvements in SQLite recently.

(Darren, any news from the maintainer of DBD::SQLite?  I would be very  
delighted and grateful to be able to use an up-to-date SQLite version  
in my Perl scripts.)

>
> [1] First retrieve all data from cell table
> SELECT * FROM cell WHERE cell_id = :cell_id
>
> [2] Now retrieve the related lc, dist and met
> SELECT lc.*
> FROM lc l JOIN cell_lc c on l.lc_id = c.lc_id
> WHERE c.cell_id = :cell_id

Just a short in the dark, but people on this list have occasionally  
pointed out that SQLite isn't very good at optimising JOIN queries, so  
it's often better to use a sub-select clause.  (SQLite isn't very good  
at optimisation in general, at least the version 3.4.0 that comes with  
Perl, and I've found that sometimes it helps a lot to give a few  
subtle hints to the optimiser ...)

Have you tried rephrasing queries [2] and [3] in this form?

> SELECT * FROM lc WHERE lc_id IN (SELECT lc_id FROM cell_lc WHERE  
> cell_id = :cell_id)

I've had good experiences with this approach, although my definition  
of good performance is rather along the lines of "completes within  
less than 5 seconds". :-)


Best regards,
Stefan Evert

[ stefan.ev...@uos.de | http://purl.org/stefan.evert ]



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] speeding up row by row lookup in a large db

2009-03-21 Thread Jim Wilcoxson
The reason you're getting the same results is because you are
CPU-bound.  I/O has nothing to do with this problem.  From your
timings of your app, 31.76/33 = 96% CPU.  If you were I/O bound, your
real time would be 33 seconds and your sys+user time would be 3
seconds, or something low.

My guess is you are spending more time looping around in Perl than you
think.  Try running your Perl code with one small, static set of test
data in a loop and see how much time it takes w/o any DB interactions.
 That will give you a baseline for performance improvements.  If it
turns out that 1000 loops w/o SQL takes 25 seconds instead of 33, none
of your SQL optimizations matter much.

Jim


On 3/21/09, P Kishor  wrote:

> So, I increased the cache_size to 1048576 but got the same results...
> 30 odd SELECTs per second.
>
> Then I created an in-memory db and copied all the data from the
> on-disk db to memory. I didn't use the backup API... simply opened a
> db connection to an in-memory db, then created all the tables and
> indexes, ATTACHed the on-disk db and did an INSERT .. SELECT * FROM
> attached db. Interestingly, the same results --
>
> [04:24 PM] ~/Data/carbonmodel$perl carbonmodel.pl
> Creating in memory tables... done.
> Transferring data to memory... done. Took: 90 wallclock secs (75.88
> usr +  8.44 sys = 84.32 CPU)
> Creating indexes... done. Took: 38 wallclock secs (23.82 usr + 13.36
> sys = 37.18 CPU)
> Prepare load testing
> ...timethis 1000: 33 wallclock secs (30.74 usr +  1.02 sys = 31.76
> CPU) @ 31.49/s (n=1000)
-- 
Software first.  Software lasts!
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] speeding up row by row lookup in a large db

2009-03-21 Thread Kees Nuyt
On Sat, 21 Mar 2009 16:37:15 -0500, P Kishor
 wrote:

> [04:24 PM] ~/Data/carbonmodel$perl carbonmodel.pl
> Creating in memory tables... done.
> Transferring data to memory... done. Took: 90 wallclock secs (75.88
> usr +  8.44 sys = 84.32 CPU)
> Creating indexes... done. Took: 38 wallclock secs (23.82 usr + 13.36
> sys = 37.18 CPU)
> Prepare load testing
> ...timethis 1000: 33 wallclock secs (30.74 usr +  1.02 sys = 31.76
> CPU) @ 31.49/s (n=1000)

Loading and indexing is pretty fast.


>So, I increased the cache_size to 1048576 but got the same results...
>30 odd SELECTs per second.

With a cache of 1M pages with a page_size of 32 kByte, your
cache would amount to 32 GByte, that's not realistic on a
32bit-mode OS.

Maybe you should try PRAGMA [default-]cache_size=5 ?
That's 1.5 GByte, and would leave some headroom for OS
diskbuffers.

> I will try out with PostGres and report back on what I get.

I'm curious.
-- 
  (  Kees Nuyt
  )
c[_]
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] speeding up row by row lookup in a large db

2009-03-21 Thread P Kishor
On Sat, Mar 21, 2009 at 2:07 PM, Nicolas Williams
 wrote:
> On Sat, Mar 21, 2009 at 01:55:32PM -0500, P Kishor wrote:
>> On Sat, Mar 21, 2009 at 1:25 PM, Nicolas Williams
>>  wrote:
>> > On Sat, Mar 21, 2009 at 12:14:43PM -0500, P Kishor wrote:
>> >> If I can't improve 33 ms per query, then I can experiment with
>> >
>> > 33ms per-query sounds like you're not caching enough of the database in
>> > memory.  What's the cache size?  Can you jack it up?
>>
>> hmmm... never thought of the cache size. From the docs...
>>
>> "PRAGMA default_cache_size = Number-of-pages;
>> [...]
>> So, any suggestions on what cache size I should experiment with? And,
>> does this have to be set *before* the db is created? From the above
>> description it sounds like I can set the cache_size at any time.
>
> Well, take the number of pages in your DB (for a vacuumed DB: file size
> / page size) and see if setting cache_size == that improves things.
>
> (default_cache_size is stored in the DB so that connections that don't
> specify a cache_size get what default_cache_size says.)
>
>> > Does the entire
>> > dataset fit in memory?  If so, why isn't it all in memory?  Or if it is
>> > all in memory, what's SQLite3 spending its time on?
>>
>> How do I take a db on disk and load it all in memory? How is that
>> done? I have never done that before. Seems like I can create a db in
>> memory with
>
> "The online-backup interface can be used to copy content from a disk
> file into an in-memory database or vice versa and it can make a hot
> backup of a live database."
>
> http://sqlite.org/backup.html
>
> But the thing is, you might just set the cache size large enough and let
> it warm up as you go -- the effect should be the same if your DB doesn't
> grow very fast.
>
>>                                                           Also, isn't
>> there a 2 GB limit to the amount of RAM that 32-bit processes can
>> address?
>
> Even so, 1GB of cache is much better than the 2000 page (x 1KB page
> size == 2MB) default.
>
> Also, you might consider going 64-bit.
>
>> By the way, even though I have a lot of computing horsepower, I would
>> like to work toward a solution that would work reasonably well even
>> without access to a cluster. While no one would expect lightning fast
>> responses for model runs over millions of cells, it would be nice to
>> cut the time from several hours down to sub-hour levels. But that is a
>> longer road to tread.
>
> Memory is the key for a large DB using SQLite3.
>
> If you're building a distributed application the SQLite3 is probably the
> wrong tool to use (though you could use SQLite3 with a local copy of a
> DB if replication is easy because your dataset is mostly read-only, say)
>


So, I increased the cache_size to 1048576 but got the same results...
30 odd SELECTs per second.

Then I created an in-memory db and copied all the data from the
on-disk db to memory. I didn't use the backup API... simply opened a
db connection to an in-memory db, then created all the tables and
indexes, ATTACHed the on-disk db and did an INSERT .. SELECT * FROM
attached db. Interestingly, the same results --

[04:24 PM] ~/Data/carbonmodel$perl carbonmodel.pl
Creating in memory tables... done.
Transferring data to memory... done. Took: 90 wallclock secs (75.88
usr +  8.44 sys = 84.32 CPU)
Creating indexes... done. Took: 38 wallclock secs (23.82 usr + 13.36
sys = 37.18 CPU)
Prepare load testing
...timethis 1000: 33 wallclock secs (30.74 usr +  1.02 sys = 31.76
CPU) @ 31.49/s (n=1000)

Maybe that's what it is then... with the amount of data I have per
SELECT, it is the same result whether the db is on disk or in memory.

I will try out with PostGres and report back on what I get.

-- 
Puneet Kishor http://www.punkish.org/
Nelson Institute for Environmental Studies http://www.nelson.wisc.edu/
Carbon Model http://carbonmodel.org/
Open Source Geospatial Foundation http://www.osgeo.org/
Sent from: Madison WI United States.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] speeding up row by row lookup in a large db

2009-03-21 Thread Kees Nuyt
On Sat, 21 Mar 2009 14:07:17 -0500, Nicolas Williams
 wrote:

>But the thing is, you might just set the cache size large enough and let
>it warm up as you go -- the effect should be the same if your DB doesn't
>grow very fast.
>
>>  [Puneet:] Also, isn't there a 2 GB limit to the amount of RAM 
>>  that 32-bit processes can address?
>
>Even so, 1GB of cache is much better than the 2000 page (x 1KB page
>size == 2MB) default.

I second this.

By the way, the SQLite team is changing the cache purging
strategy between 3.6.11 and 3.6.12 (e.g. checkin 6341
http://www.sqlite.org/cvstrac/chngview?cn=6341 ).

The effect would be that often used pages (like non-leaf
index pages) are retained, which could reduce I/O.

>Also, you might consider going 64-bit.
-- 
  (  Kees Nuyt
  )
c[_]
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] speeding up row by row lookup in a large db

2009-03-21 Thread Nicolas Williams
On Sat, Mar 21, 2009 at 01:55:32PM -0500, P Kishor wrote:
> On Sat, Mar 21, 2009 at 1:25 PM, Nicolas Williams
>  wrote:
> > On Sat, Mar 21, 2009 at 12:14:43PM -0500, P Kishor wrote:
> >> If I can't improve 33 ms per query, then I can experiment with
> >
> > 33ms per-query sounds like you're not caching enough of the database in
> > memory.  What's the cache size?  Can you jack it up?
> 
> hmmm... never thought of the cache size. From the docs...
> 
> "PRAGMA default_cache_size = Number-of-pages;
> [...]
> So, any suggestions on what cache size I should experiment with? And,
> does this have to be set *before* the db is created? From the above
> description it sounds like I can set the cache_size at any time.

Well, take the number of pages in your DB (for a vacuumed DB: file size
/ page size) and see if setting cache_size == that improves things.

(default_cache_size is stored in the DB so that connections that don't
specify a cache_size get what default_cache_size says.)

> > Does the entire
> > dataset fit in memory?  If so, why isn't it all in memory?  Or if it is
> > all in memory, what's SQLite3 spending its time on?
> 
> How do I take a db on disk and load it all in memory? How is that
> done? I have never done that before. Seems like I can create a db in
> memory with

"The online-backup interface can be used to copy content from a disk
file into an in-memory database or vice versa and it can make a hot
backup of a live database."

http://sqlite.org/backup.html

But the thing is, you might just set the cache size large enough and let
it warm up as you go -- the effect should be the same if your DB doesn't
grow very fast.

>   Also, isn't
> there a 2 GB limit to the amount of RAM that 32-bit processes can
> address?

Even so, 1GB of cache is much better than the 2000 page (x 1KB page
size == 2MB) default.

Also, you might consider going 64-bit.

> By the way, even though I have a lot of computing horsepower, I would
> like to work toward a solution that would work reasonably well even
> without access to a cluster. While no one would expect lightning fast
> responses for model runs over millions of cells, it would be nice to
> cut the time from several hours down to sub-hour levels. But that is a
> longer road to tread.

Memory is the key for a large DB using SQLite3.

If you're building a distributed application the SQLite3 is probably the
wrong tool to use (though you could use SQLite3 with a local copy of a
DB if replication is easy because your dataset is mostly read-only, say)

Nico
-- 
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] speeding up row by row lookup in a large db

2009-03-21 Thread P Kishor
On Sat, Mar 21, 2009 at 1:25 PM, Nicolas Williams
 wrote:
> On Sat, Mar 21, 2009 at 12:14:43PM -0500, P Kishor wrote:
>> If I can't improve 33 ms per query, then I can experiment with
>
> 33ms per-query sounds like you're not caching enough of the database in
> memory.  What's the cache size?  Can you jack it up?

hmmm... never thought of the cache size. From the docs...

"PRAGMA default_cache_size = Number-of-pages;

Query or change the maximum number of database disk pages that SQLite
will hold in memory at once. Each page uses 1K on disk and about 1.5K
in memory. This pragma works like the cache_size pragma with the
additional feature that it changes the cache size persistently. With
this pragma, you can set the cache size once and that setting is
retained and reused every time you reopen the database."

So, any suggestions on what cache size I should experiment with? And,
does this have to be set *before* the db is created? From the above
description it sounds like I can set the cache_size at any time.

> Does the entire
> dataset fit in memory?  If so, why isn't it all in memory?  Or if it is
> all in memory, what's SQLite3 spending its time on?

How do I take a db on disk and load it all in memory? How is that
done? I have never done that before. Seems like I can create a db in
memory with

my $dbh = DBI->connect('dbi:SQLite:dbname=:memory:');

But how do I take a db on disk and load it all in memory. Also, isn't
there a 2 GB limit to the amount of RAM that 32-bit processes can
address?

By the way, even though I have a lot of computing horsepower, I would
like to work toward a solution that would work reasonably well even
without access to a cluster. While no one would expect lightning fast
responses for model runs over millions of cells, it would be nice to
cut the time from several hours down to sub-hour levels. But that is a
longer road to tread.


>
> Can you profile your application?  You could use DTrace if on Solaris,
> FreeBSD or MacOS X (actually, there's an early port of DTrace to Linux
> too that I hear is usable).
>

I will do that eventually. For now, I don't really have an
application, just the very beginnings of a db. What you see in this
thread are results from my profiling, albeit at a very coarse level
using Benchmark Perl module.


> Nico
> --
>



-- 
Puneet Kishor http://www.punkish.org/
Nelson Institute for Environmental Studies http://www.nelson.wisc.edu/
Carbon Model http://carbonmodel.org/
Open Source Geospatial Foundation http://www.osgeo.org/
Sent from: Madison WI United States.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] speeding up row by row lookup in a large db

2009-03-21 Thread Nicolas Williams
On Sat, Mar 21, 2009 at 12:14:43PM -0500, P Kishor wrote:
> If I can't improve 33 ms per query, then I can experiment with

33ms per-query sounds like you're not caching enough of the database in
memory.  What's the cache size?  Can you jack it up?  Does the entire
dataset fit in memory?  If so, why isn't it all in memory?  Or if it is
all in memory, what's SQLite3 spending its time on?

Can you profile your application?  You could use DTrace if on Solaris,
FreeBSD or MacOS X (actually, there's an early port of DTrace to Linux
too that I hear is usable).

Nico
-- 
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] speeding up row by row lookup in a large db

2009-03-21 Thread Jim Wilcoxson
I'm not sure I completely understand your data structure, but here are
some ideas:

First, conduct experiments with different page and cache sizes.  I
don't know if you jumped to a 32K page size or experimented with each
size, but experimentation is the way to go.

I'm guessing that in your cell_lc and cell_dist tables, rows are
unique on cell_id.  If so, you could make cell_id the primary key and
avoid the indexes altogether.

When you posted before, you mentioned that you had multi-TB hard
drives, tons of memory, quad CPU's, etc, and didn't care about saving
space.  I'd denormalize the data, putting lc and dist fields in the
cell table.  That will eliminate join #2 and join #3.

You don't really say what you're doing with the results of these
queries below.  Are you doing a bunch off Perlish stuff with the
results?  It's possible that most of your time is being spent in the
Perl interpreter rather than in SQLite.  To check this, try sending
the output to /dev/null and using the sqlite3 command line tool to
execute your queries.  If they happen quickly, then the overhead is in
the manipulation of the results, not the queries.

Good luck!
Jim


On 3/21/09, P Kishor  wrote:
> Part 1.
> ---
>
> I have the following schema in a SQLite db that is 430 MB on my
> Macbook laptop's 320 GB HFS+ formatted 7200 RPM disk with an 8 MB
> cache.
>
> -- 1000,000 rows
> CREATE TABLE cell (
>   cell_id INTEGER PRIMARY KEY,
>   met_cell_id INTEGER,
>   8 other INTEGER or REAL columns
> )
>
> -- 38 rows
> CREATE TABLE lc (
>   lc_id INTEGER PRIMARY KEY,
>   56 other INTEGER or REAL columns
> )
>
> -- 10 rows
> CREATE TABLE dist (
>   dist_id INTEGER PRIMARY KEY,
>   37 other INTEGER or REAL columns
> )
>
> -- 2,920,000
> CREATE TABLE met (
>   met_id INTEGER PRIMARY KEY,
>   met_cell_id INTEGER,
>   9 other INTEGER or REAL columns
> )
>
> CREATE TABLE cell_lc (cell_id INTEGER, lc_id INTEGER)
> CREATE TABLE cell_dist (cell_id INTEGER, dist_id INTEGER)
>
> CREATE INDEX idx_met_cell_id ON met (met_cell_id)
> CREATE INDEX idx_cell_lc ON cell_lc (cell_id)
> CREATE INDEX idx_cell_dist ON cell_dist (cell_id)
>
> I also have an R*Tree index, but that is a different story, not relevant
> here.
>
> I retrieve *all* data for one cell ':cell_id' using the following queries
>
> [1] First retrieve all data from cell table
> SELECT * FROM cell WHERE cell_id = :cell_id
>
> [2] Now retrieve the related lc, dist and met
> SELECT lc.*
> FROM lc l JOIN cell_lc c on l.lc_id = c.lc_id
> WHERE c.cell_id = :cell_id
>
> [3] Retrieve the related dist
> SELECT d.*
> FROM dist d JOIN cell_lc c on d.dist_id = c.dist_id
> WHERE c.cell_id = :cell_id
>
> [4] Retrieve the related met
> SELECT * FROM met WHERE met_cell_id = 
>
> I did some benchmarking with the above schema using Perl DBI, and I
> get about 30 transactions per second as long as I returning the data
> to memory.
>
> [08:38 AM] ~/Data/carbonmodel$perl carbonmodel.pl
>
> timethis 1:  0 wallclock secs ( 0.03 usr +  0.00 sys =  0.03 CPU) @
> 33.33/s (n=1)
>
> timethis 10:  0 wallclock secs ( 0.31 usr +  0.02 sys =  0.33 CPU) @
> 30.30/s (n=10)
>
> timethis 100:  3 wallclock secs ( 2.85 usr +  0.20 sys =  3.05 CPU) @
> 32.79/s (n=100)
>
> timethis 1000: 33 wallclock secs (31.08 usr +  1.22 sys = 32.30 CPU) @
> 30.96/s (n=1000)
>
> if I write the data to file, the speed drops to about 1 transaction per
> second
>
> timethis 1000: 783 wallclock secs (732.26 usr + 18.22 sys = 750.48
> CPU) @  1.33/s (n=1000)
>
> Even if I stick with manipulating the data in memory, at 30
> transactions per second (or 33 ms per transaction), it would take more
> than 9 hours to query each of the 1 million cells one by one.
>
> In the real world, I will first find the relevant cell ids based on
> lat-lon bounds (hence my R*Tree index) and then extract their data one
> by one.
>
> How can I, if at all, speed this up?
>
> Part 2.
> ---
>
> Alternatively, I could denormalize the data completely. Inspired by a
> post on the Flickr blog
> (http://code.flickr.com/blog/2009/03/18/building-fast-client-side-searches/),
> in particular the para "To make this data available quickly from the
> server, we maintain and update a per-member cache in our database,
> where we store each member’s contact list in a text blob — this way
> it’s a single quick DB query to retrieve it. We can format this blob
> in any way we want: XML, JSON, etc" I decided to experiment with the
> same technique. So...
>
> CREATE TABLE cell_blobs (cell_id INTEGER PRIMARY KEY, cell_data BLOB);
>
> I then queried each cell as in Part 1, serialized it and stored it in
> the cell_blobs table. My intent is to simply retrieve a BLOB and
> deserialize it... it would *possibly* be quicker than 33 ms per
> retrieval. Well, I haven't yet completed this test because each BLOB
> is taking about 430 KB. At 1 million rows, that is going to occupy
> upward of 400 GB. I broke the load_blob_table routine after about a
> third of the records had 

Re: [sqlite] speeding up row by row lookup in a large db

2009-03-21 Thread P Kishor
On Sat, Mar 21, 2009 at 12:03 PM, Kees Nuyt  wrote:
>
> Hi Puneet,
>
> On Sat, 21 Mar 2009 10:47:44 -0500, P Kishor
>  wrote:
>
>>I should have mentioned the page_size in my OP. It is 32768 set by me
>>at the start of db creation.
>>
>>Yes, the db connection is kept open.
>
> Hm, apart from faster disks (15k RPM or high end SSD) I have
> no idea what else can be done to improve performance.


Ok. That is a good "limit" to know. One scenario I would like to
experiment with is doing a full fledged BLOB testing. For that, I need
to be able to convert the entire dataset into the BLOBs table. As I
mentioned in my original post, the BLOBs are taking about 430 KB per
cell which is going to result in a 400+ GB db (problem, given my
laptop has only 170 GB free... but I could do the testing on one my
servers). The real problem is that the conversion is taking way too
long. I set the process last night on my laptop, and this morning only
about 1/4 of the table had converted. The conversion will be a
one-time job, so it is not a big deal if it takes 4 days to do this,
but still... would be nice to speed this up.

Unfortunately, I can't use 15 K RPM disks, but I must add here that in
the "real world" this db will reside on a 3 GHz quad core dual Xeon
Xserve with 32 GB RAM and 2.73 TB hardware RAIDed "disk" (3 x 1 TB
7200 RPM disks that Apple supplies).

If I can't improve 33 ms per query, then I can experiment with
chopping up the task, which will be done via Xgrid anyway. The model
will run on a 10-Xserve cluster, so I can create 10 copies of the db.
This will allow each instance of the model run to query its own copy
of the db, potentially divvying up the 9 hours of querying into 0.9
hours of querying.

Of course, an entirely other route is to do the testing/development
with SQLite, and then move to PostGres for production. That way we can
have only one copy of the db, and query from the 10 cluster servers
over IP.



-- 
Puneet Kishor http://www.punkish.org/
Nelson Institute for Environmental Studies http://www.nelson.wisc.edu/
Carbon Model http://carbonmodel.org/
Open Source Geospatial Foundation http://www.osgeo.org/
Sent from: Madison WI United States.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] speeding up row by row lookup in a large db

2009-03-21 Thread Kees Nuyt

Hi Puneet,

On Sat, 21 Mar 2009 10:47:44 -0500, P Kishor
 wrote:

>I should have mentioned the page_size in my OP. It is 32768 set by me
>at the start of db creation.
>
>Yes, the db connection is kept open.

Hm, apart from faster disks (15k RPM or high end SSD) I have
no idea what else can be done to improve performance.
-- 
  (  Kees Nuyt
  )
c[_]
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] speeding up row by row lookup in a large db

2009-03-21 Thread P Kishor
On Sat, Mar 21, 2009 at 10:28 AM, Kees Nuyt  wrote:
>
> Hi Puneet,
>
> On Sat, 21 Mar 2009 09:31:45 -0500, P Kishor
>  wrote:
>
>>Part 1.
>>---
>>
>>I have the following schema in a SQLite db that is 430 MB on my
>>Macbook laptop's 320 GB HFS+ formatted 7200 RPM disk with an 8 MB
>>cache.
>
> [...]
>
>>How can I, if at all, speed this up?
>
> What is your page_size?
> Does it match your platforms optimum I/O size?
> Is the database connection kept open, so the cache remains
> valid?

I should have mentioned the page_size in my OP. It is 32768 set by me
at the start of db creation.

Yes, the db connection is kept open.


>
>>Part 2.
>>---
>
> [...]
>> Well, I haven't yet completed this test because
>>each BLOB is taking about 430 KB.
>
> [...]
>
>> I broke the load_blob_table routine after about a
>>third of the records had been processed because I found even the
>>loading_the_blobs to be excruciatingly slow.
>>
>>Suggestions?
>
> Especially BLOBs will benefit from a large page_size, I
> think. For this schema and estimated BLOB size I would start
> with the maximum page_size allowed, that is 32768 bytes.
> --
>  (  Kees Nuyt
>  )
> c[_]
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
Puneet Kishor http://www.punkish.org/
Nelson Institute for Environmental Studies http://www.nelson.wisc.edu/
Carbon Model http://carbonmodel.org/
Open Source Geospatial Foundation http://www.osgeo.org/
Sent from: Madison WI United States.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] speeding up row by row lookup in a large db

2009-03-21 Thread Kees Nuyt

Hi Puneet,

On Sat, 21 Mar 2009 09:31:45 -0500, P Kishor
 wrote:

>Part 1.
>---
>
>I have the following schema in a SQLite db that is 430 MB on my
>Macbook laptop's 320 GB HFS+ formatted 7200 RPM disk with an 8 MB
>cache.

[...]

>How can I, if at all, speed this up?

What is your page_size?
Does it match your platforms optimum I/O size?
Is the database connection kept open, so the cache remains
valid?

>Part 2.
>---

[...]
> Well, I haven't yet completed this test because 
>each BLOB is taking about 430 KB.

[...]

> I broke the load_blob_table routine after about a
>third of the records had been processed because I found even the
>loading_the_blobs to be excruciatingly slow.
>
>Suggestions? 

Especially BLOBs will benefit from a large page_size, I
think. For this schema and estimated BLOB size I would start
with the maximum page_size allowed, that is 32768 bytes.
-- 
  (  Kees Nuyt
  )
c[_]
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] speeding up row by row lookup in a large db

2009-03-21 Thread P Kishor
Part 1.
---

I have the following schema in a SQLite db that is 430 MB on my
Macbook laptop's 320 GB HFS+ formatted 7200 RPM disk with an 8 MB
cache.

-- 1000,000 rows
CREATE TABLE cell (
  cell_id INTEGER PRIMARY KEY,
  met_cell_id INTEGER,
  8 other INTEGER or REAL columns
)

-- 38 rows
CREATE TABLE lc (
  lc_id INTEGER PRIMARY KEY,
  56 other INTEGER or REAL columns
)

-- 10 rows
CREATE TABLE dist (
  dist_id INTEGER PRIMARY KEY,
  37 other INTEGER or REAL columns
)

-- 2,920,000
CREATE TABLE met (
  met_id INTEGER PRIMARY KEY,
  met_cell_id INTEGER,
  9 other INTEGER or REAL columns
)

CREATE TABLE cell_lc (cell_id INTEGER, lc_id INTEGER)
CREATE TABLE cell_dist (cell_id INTEGER, dist_id INTEGER)

CREATE INDEX idx_met_cell_id ON met (met_cell_id)
CREATE INDEX idx_cell_lc ON cell_lc (cell_id)
CREATE INDEX idx_cell_dist ON cell_dist (cell_id)

I also have an R*Tree index, but that is a different story, not relevant here.

I retrieve *all* data for one cell ':cell_id' using the following queries

[1] First retrieve all data from cell table
SELECT * FROM cell WHERE cell_id = :cell_id

[2] Now retrieve the related lc, dist and met
SELECT lc.*
FROM lc l JOIN cell_lc c on l.lc_id = c.lc_id
WHERE c.cell_id = :cell_id

[3] Retrieve the related dist
SELECT d.*
FROM dist d JOIN cell_lc c on d.dist_id = c.dist_id
WHERE c.cell_id = :cell_id

[4] Retrieve the related met
SELECT * FROM met WHERE met_cell_id = 

I did some benchmarking with the above schema using Perl DBI, and I
get about 30 transactions per second as long as I returning the data
to memory.

[08:38 AM] ~/Data/carbonmodel$perl carbonmodel.pl

timethis 1:  0 wallclock secs ( 0.03 usr +  0.00 sys =  0.03 CPU) @
33.33/s (n=1)

timethis 10:  0 wallclock secs ( 0.31 usr +  0.02 sys =  0.33 CPU) @
30.30/s (n=10)

timethis 100:  3 wallclock secs ( 2.85 usr +  0.20 sys =  3.05 CPU) @
32.79/s (n=100)

timethis 1000: 33 wallclock secs (31.08 usr +  1.22 sys = 32.30 CPU) @
30.96/s (n=1000)

if I write the data to file, the speed drops to about 1 transaction per second

timethis 1000: 783 wallclock secs (732.26 usr + 18.22 sys = 750.48
CPU) @  1.33/s (n=1000)

Even if I stick with manipulating the data in memory, at 30
transactions per second (or 33 ms per transaction), it would take more
than 9 hours to query each of the 1 million cells one by one.

In the real world, I will first find the relevant cell ids based on
lat-lon bounds (hence my R*Tree index) and then extract their data one
by one.

How can I, if at all, speed this up?

Part 2.
---

Alternatively, I could denormalize the data completely. Inspired by a
post on the Flickr blog
(http://code.flickr.com/blog/2009/03/18/building-fast-client-side-searches/),
in particular the para "To make this data available quickly from the
server, we maintain and update a per-member cache in our database,
where we store each member’s contact list in a text blob — this way
it’s a single quick DB query to retrieve it. We can format this blob
in any way we want: XML, JSON, etc" I decided to experiment with the
same technique. So...

CREATE TABLE cell_blobs (cell_id INTEGER PRIMARY KEY, cell_data BLOB);

I then queried each cell as in Part 1, serialized it and stored it in
the cell_blobs table. My intent is to simply retrieve a BLOB and
deserialize it... it would *possibly* be quicker than 33 ms per
retrieval. Well, I haven't yet completed this test because each BLOB
is taking about 430 KB. At 1 million rows, that is going to occupy
upward of 400 GB. I broke the load_blob_table routine after about a
third of the records had been processed because I found even the
loading_the_blobs to be excruciatingly slow.

Suggestions? I am posting the same query on Perlmonks so hopefully I
will have a wealth of suggestions.


-- 
Puneet Kishor http://www.punkish.org/
Nelson Institute for Environmental Studies http://www.nelson.wisc.edu/
Carbon Model http://carbonmodel.org/
Open Source Geospatial Foundation http://www.osgeo.org/
Sent from: Madison WI United States.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] writting a text file from sqlite db in C

2009-03-21 Thread Kees Nuyt
On Fri, 20 Mar 2009 08:12:28 -0400 (EDT),
mrobi...@cs.fiu.edu wrote:

>Thanks for your help.
>
>I checked those sites, and I worked with sqlite3.exe using the command
>line tool, no problems, however I can not find anywhere samples of how to
>applied these commands in C.
>
>Thru trial and error, I can now use sqlite3_open(), sqlite3_close(), to
>open and close databases as well sqlite3_exec() to create tables, fields,
>indexes, and insert data into the tables. I am stuck trying to do a
>"select" and process the data in the tables.
>
>Please help!!!

Did you have a look at:
http://www.sqlite.org/cvstrac/fileview?f=sqlite/src/shell.c=1.207
?
For example, check the callback() function.

>Michael
>
>
>
>
>> On Thu, 12 Mar 2009 19:06:52 -0400 (EDT),
>> mrobi...@cs.fiu.edu wrote:
>>
>>>Hi everybody,
>>>
>>>I would like to find the exact code in C to
>>>output to a text file, data from a sqlite db in "C"
>>
>> Try the source for the sqlite command line tool.
>>
>>>Also, is there a place to find C sample code for Sqlite?
>>
>> Apart from the sqlite command line tool, I can recommend
>> the source of fossil, which does all its magic around a
>> SQLite database:
>> http://www.fossil-scm.org/index.html/doc/tip/www/index.wiki
>>
>> Also, there is some code in the SQLite wiki:
>> http://www.sqlite.org/cvstrac/wiki?p=SampleCode
>>
>>>Thank you very much
>>>
>>>Michael
>>
>> --
>>   (  Kees Nuyt
>>   )
>> c[_]
>> ___
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>
>
>
>___
>sqlite-users mailing list
>sqlite-users@sqlite.org
>http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
-- 
  (  Kees Nuyt
  )
c[_]
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Use sqlite3_update_hook() to watch for changes

2009-03-21 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Nicolas Williams wrote:
> The OS can't know when a SQLite3 transaction has finished,

It doesn't need to.  A second connection will enter its busy handler
until the transaction in the first connection has completed.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknEqxkACgkQmOOfHg372QSMDACgrgxpKpxHZVJ0hz1tieczuUyr
8roAoI49AwX58itgi0MZqSV6MsGkUt3n
=ASjj
-END PGP SIGNATURE-
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users