Re: [sqlite] designing a db to hold repeating data

2009-03-16 Thread P Kishor
On Sun, Mar 15, 2009 at 8:57 PM, Rich Shepard  wrote:
> On Sun, 15 Mar 2009, P Kishor wrote:
>
>> CREATE TABLE met (
>>        met_id INTEGER PRIMARY KEY,
>>        other met attributes,
>>        met_grid_id INTEGER
>> );
>
>   Given the quantity of data you have I urge you to read Joe Celko's "SQL
> Programming Style" before you dig yourself into a really deep hole. For
> example, your met_grid covers a defined land area. It would be much better
> to use the geographic coordinates (lat/lon, UTM, State Plane) of the center
> of each grid as the primary key. On the other hand, if all your information
> is associated with the met_grid, then its center should be the unique
> identifier. Your *_id are artificial and convey no useful information.
>
>> Imagine a million square kms area divided into 1 sq. kms. cells. Those are
>> the land cells. Now imagine another overlay, the met grid, on top of the
>> land grid. The met grid is also a million square kms., but is tessellated
>> into 2500 sq. kms. cells, that is, a 20 x 20 grid.
>
>   What you could do is have a Grid table whose primary key is the center of
> the central met_grid, and which references all the other met_grids it
> contains.
>
>   Given your schema above, how can you tell that a specific 1 km**2 met cell
> belongs to a specific 2500 km**2 grid cell? The numbers are meaning less.
> Geographic coordinates use real, meaningful data to associate all 2500 met
> cells in each grid cell.
>
>> I have 7 met attributes for every day of 20 years (that is, 7300 rows)
>> for each "met_cell". So, imagine a stack of 7300 rows sitting on each
>> met cell. In other words, each set of 7300 rows is related to a block
>> of 2500 land cells.
>
>   Then your Met table contains 7 columns (one for each attribute), plus the
> geographic center columns (Northing/Easting, Lat/Lon), and the date. That's
> 10 columns and 7300 rows.
>
>   Now you have a start toward a normalized table.
>
>   I suspect that your thinking has been shaped by trying to use spreadsheets
> as databases, and thinking of SQL as a procedural language rather than a
> declarative language that works on sets of tables (rows, columns) instead of
> files (records, fields).
>
> Rich
>


I appreciate your help Rich, but your suspicion is unfounded and
wrong. I don't use spreadsheets other than when doing my taxes, and
while I am only about 1/100th as good as Igor at SQL, that probably
places me in the upper 90s percentile.

That said, my original question was about designing a db for ease of
use and speed of query retrieval, not for size or memory constraints.

I also have a fairly long history with GIS, so I have considered fully
the benefits and drawbacks of using lat/lon as the the key.
Geographically unique keys are useless for my problem even though the
problem space is geographic.

I don't really need to know which cell_id or what location I am trying
to query... I will build a map front end that will allow the users to
choose an area of interest and send the bounding box back to the
application. The application will take that bounding box and locate
the cells. Hence, the R*Tree index. The cell_ids will then be used to
return the spatial and weather attributes which will be passed to the
model. Hence, the cell_ids and met_ids can be as artificial as
Sucralose.

In any case, I built the db last night. Weighs in at about 450 MB. I
basically loaded all the weather data in a single table, and linked
the spatial cells to the weather cells using met_id. Built an index on
met_id, and, of course, the R*Tree index. Did a few cursory tests. The
performance is outstanding.

Now I need to do a few load tests, and I think I am well on my way to
the next step.

I think I have converted a procedure that took many several hours
doing data prep to a process that returns data in a few milliseconds.

Simple schema, simple db, simply great results.

-- 
Puneet Kishor http://www.punkish.org/
Nelson Institute for Environmental Studies http://www.nelson.wisc.edu/
Carbon Model http://carbonmodel.org/
Open Source Geospatial Foundation http://www.osgeo.org/
Sent from: Madison WI United States.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] designing a db to hold repeating data

2009-03-15 Thread Rich Shepard
On Sun, 15 Mar 2009, P Kishor wrote:

> My only concern is query speed, data integrity (of course, that would be
> helped by avoiding redundancy), and ease of querying. If I have a 300+ MB
> db, saving 5 or 14 MB doesn't gain me anything.

> I want to do something like this --
>
> "Give me all the land and weather data for cell number ?"
>
> or
>
> "Give me all the land and weather data for all the cells lying within the box 
> ?"

   All the more reason to think carefully about your schema. How would you
determine cell number x? Use a book of look-ups? Memorize the location of
each ID number?

Rich

-- 
Richard B. Shepard, Ph.D.   |  IntegrityCredibility
Applied Ecosystem Services, Inc.|Innovation
 Voice: 503-667-4517  Fax: 503-667-8863
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] designing a db to hold repeating data

2009-03-15 Thread Rich Shepard
On Sun, 15 Mar 2009, P Kishor wrote:

> CREATE TABLE met (
>met_id INTEGER PRIMARY KEY,
>other met attributes,
>met_grid_id INTEGER
> );

   Given the quantity of data you have I urge you to read Joe Celko's "SQL
Programming Style" before you dig yourself into a really deep hole. For
example, your met_grid covers a defined land area. It would be much better
to use the geographic coordinates (lat/lon, UTM, State Plane) of the center
of each grid as the primary key. On the other hand, if all your information
is associated with the met_grid, then its center should be the unique
identifier. Your *_id are artificial and convey no useful information.

> Imagine a million square kms area divided into 1 sq. kms. cells. Those are
> the land cells. Now imagine another overlay, the met grid, on top of the
> land grid. The met grid is also a million square kms., but is tessellated
> into 2500 sq. kms. cells, that is, a 20 x 20 grid.

   What you could do is have a Grid table whose primary key is the center of
the central met_grid, and which references all the other met_grids it
contains.

   Given your schema above, how can you tell that a specific 1 km**2 met cell
belongs to a specific 2500 km**2 grid cell? The numbers are meaning less.
Geographic coordinates use real, meaningful data to associate all 2500 met
cells in each grid cell.

> I have 7 met attributes for every day of 20 years (that is, 7300 rows)
> for each "met_cell". So, imagine a stack of 7300 rows sitting on each
> met cell. In other words, each set of 7300 rows is related to a block
> of 2500 land cells.

   Then your Met table contains 7 columns (one for each attribute), plus the
geographic center columns (Northing/Easting, Lat/Lon), and the date. That's
10 columns and 7300 rows.

   Now you have a start toward a normalized table.

   I suspect that your thinking has been shaped by trying to use spreadsheets
as databases, and thinking of SQL as a procedural language rather than a
declarative language that works on sets of tables (rows, columns) instead of
files (records, fields).

Rich

-- 
Richard B. Shepard, Ph.D.   |  IntegrityCredibility
Applied Ecosystem Services, Inc.|Innovation
 Voice: 503-667-4517  Fax: 503-667-8863
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] designing a db to hold repeating data

2009-03-15 Thread P Kishor
On Sun, Mar 15, 2009 at 8:28 PM, Jim Wilcoxson  wrote:
> You could eliminate met_grid_id from the cells table and replace it
> with an expression cell_id/2500.  This expression will automatically
> truncate, giving you met_grid_id whenever you need it.  This will save
> around 5 MB for a 1M row cell table.
>
> Also, queries in the cells table by met_grid_id, if you do that, can
> be made much faster using the relationship above.  If you query
> directly on met_grid_id, sqlite will have to do a table scan.  But if
> you use the relationship, sqlite can use the index, for example:
>
>   select * from cells where met_grid_id = N  (table scan required)
>
> becomes:
>
>   select *,cell_id/2500 as met_grid_id from cells where cell_id
> between N*2500 and (N*5000)-1   (can use the cell_id index)
>
> I think by the same method you can eliminate met_grid_id from the met
> table, saving around 14.5MB


Interesting idea, and worth thinking about.

Of course, I can index met_grid_id and achieve the same speed boost, no?

I should have emphasized in my original post --

We are not constrained by db size, memory, and CPU power; we have
plenty of those to spare (we will be running our model on an Xgrid of
10 quad core Xserves). In fact, each machine is kitted with 32 GB of
RAM and 3 TB disk space, so I could load the entire db in memory and
work from there.

My only concern is query speed, data integrity (of course, that would
be helped by avoiding redundancy), and ease of querying. If I have a
300+ MB db, saving 5 or 14 MB doesn't gain me anything.

I want to do something like this --

"Give me all the land and weather data for cell number ?"

or

"Give me all the land and weather data for all the cells lying within the box ?"

Hence the R*Tree.


>
> Jim
>
>
> On 3/15/09, P Kishor  wrote:
>> I have a grid of 1000 x 1000 cells with their own data as well as 20
>> years of daily weather data (20 * 365 = 7300 rows) for each contiguous
>> group of 50 x 50 cell.
>>
>> CREATE TABLE cells (
>>       cell_id INTEGER PRIMARY KEY,
>>       other cell attributes,
>>       lat,
>>       lon,
>>       met_grid_id INTEGER
>> );
>>
>> cell_id met_grid_id
>> --- 
>> 0       0
>> 1       0
>> ..
>> 2499    0
>> 2500    1
>> 2501    1
>> ..
>> 4999    1
>>
>> CREATE TABLE met (
>>       met_id INTEGER PRIMARY KEY,
>>       other met attributes,
>>       met_grid_id INTEGER
>> );
>>
>> met_id met_grid_id
>> -- ---
>> 0      0
>> 1      0
>> ..
>> 7299   0
>> 7300   1
>> 7301   1
>> ..
>>
>> CREATE VIRTUAL TABLE cell_index USING rtree (
>>       cell_id INTEGER,
>>       minx REAL,
>>       maxx REAL,
>>       miny REAL,
>>       maxy REAL
>> )
>>
>> The db is about 350 MB with the cell table with 1000,000 rows and the
>> met table with 2,920,000 rows and the R*Tree index.
>>
>> Is there any other better way that jumps out at any of you?
>>
>>
>> --
>> Puneet Kishor http://www.punkish.org/
>> Nelson Institute for Environmental Studies http://www.nelson.wisc.edu/
>> Carbon Model http://carbonmodel.org/
>> Open Source Geospatial Foundation http://www.osgeo.org/
>> Sent from: Madison WI United States.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] designing a db to hold repeating data

2009-03-15 Thread Jim Wilcoxson
You could eliminate met_grid_id from the cells table and replace it
with an expression cell_id/2500.  This expression will automatically
truncate, giving you met_grid_id whenever you need it.  This will save
around 5 MB for a 1M row cell table.

Also, queries in the cells table by met_grid_id, if you do that, can
be made much faster using the relationship above.  If you query
directly on met_grid_id, sqlite will have to do a table scan.  But if
you use the relationship, sqlite can use the index, for example:

   select * from cells where met_grid_id = N  (table scan required)

becomes:

   select *,cell_id/2500 as met_grid_id from cells where cell_id
between N*2500 and (N*5000)-1   (can use the cell_id index)

I think by the same method you can eliminate met_grid_id from the met
table, saving around 14.5MB

Jim


On 3/15/09, P Kishor  wrote:
> I have a grid of 1000 x 1000 cells with their own data as well as 20
> years of daily weather data (20 * 365 = 7300 rows) for each contiguous
> group of 50 x 50 cell.
>
> CREATE TABLE cells (
>   cell_id INTEGER PRIMARY KEY,
>   other cell attributes,
>   lat,
>   lon,
>   met_grid_id INTEGER
> );
>
> cell_id met_grid_id
> --- 
> 0   0
> 1   0
> ..
> 24990
> 25001
> 25011
> ..
> 49991
>
> CREATE TABLE met (
>   met_id INTEGER PRIMARY KEY,
>   other met attributes,
>   met_grid_id INTEGER
> );
>
> met_id met_grid_id
> -- ---
> 0  0
> 1  0
> ..
> 7299   0
> 7300   1
> 7301   1
> ..
>
> CREATE VIRTUAL TABLE cell_index USING rtree (
>   cell_id INTEGER,
>   minx REAL,
>   maxx REAL,
>   miny REAL,
>   maxy REAL
> )
>
> The db is about 350 MB with the cell table with 1000,000 rows and the
> met table with 2,920,000 rows and the R*Tree index.
>
> Is there any other better way that jumps out at any of you?
>
>
> --
> Puneet Kishor http://www.punkish.org/
> Nelson Institute for Environmental Studies http://www.nelson.wisc.edu/
> Carbon Model http://carbonmodel.org/
> Open Source Geospatial Foundation http://www.osgeo.org/
> Sent from: Madison WI United States.
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
-- 
Software first.  Software lasts!
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] designing a db to hold repeating data

2009-03-15 Thread John Machin
On 16/03/2009 11:45 AM, P Kishor wrote:
> On Sun, Mar 15, 2009 at 7:29 PM, John Machin  wrote:
>> On 16/03/2009 11:00 AM, P Kishor wrote:
>>> I have a grid of 1000 x 1000 cells with their own data as well as 20
>>> years of daily weather data (20 * 365 = 7300 rows) for each contiguous
>>> group of 50 x 50 cell.
>>>
>> [snip]
>>> CREATE TABLE met (
>>>met_id INTEGER PRIMARY KEY,
>>>other met attributes,
>>>met_grid_id INTEGER
>>> );
>> [snip]
>>
>>> The db is about 350 MB with the cell table with 1000,000 rows and the
>>> met table with 2,920,000 rows and the R*Tree index.
>>>
>>> Is there any other better way that jumps out at any of you?
>> How many unique sets of "other met attributes" are there? If the answer is
>> substantially less than 2,920,000, you could consider parking those on yet
>> another table, referenced from the "met" table.
>>
> 
> 
> I am not sure what that gains me?

If there are are only (say) 100,000 unique sets of "other met 
attributes", then you can save substantial space, at the cost of some
complexity.

Here's a simple example of the concept in practice:

Imagine a spreadsheet containing the names and addresses of the 
employees of a large corporation. The "surname / family name / last 
name" column is quite likely to have many fewer unique entries than the 
number of rows. The same is likely to apply to the "city / location" 
column and the "zipcode / postcode" column [assuming the zipcode is kept 
as text and not a number].

Space is saved by keeping the unique strings in a "shared string table"; 
text cells contain an index into the shared string table.

> 
> Imagine a million square kms area divided into 1 sq. kms. cells. Those
> are the land cells.
> 
> Now imagine another overlay, the met grid, on top of the land grid.
> The met grid is also a million square kms., but is tessellated into
> 2500 sq. kms. cells, that is, a 20 x 20 grid.
> 
> I have 7 met attributes for every day of 20 years (that is, 7300 rows)
> for each "met_cell". So, imagine a stack of 7300 rows sitting on each
> met cell. In other words, each set of 7300 rows is related to a block
> of 2500 land cells.
> 
> Hope that helps picture the problem.

It appears to match what I had to infer from your sample blah_blah_id 
numbers :-)

Cheers,
John


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] designing a db to hold repeating data

2009-03-15 Thread Alex Mandel
You should talk with Alessandro Furieri a.furi...@lqt.it who is
currently working on adding Raster support to spatialite.
http://www.gaia-gis.it/spatialite/

It may be more efficient to store the data separately in well used
raster formats and attach them to the database as needed for queries.

Alex

P Kishor wrote:
> I have a grid of 1000 x 1000 cells with their own data as well as 20
> years of daily weather data (20 * 365 = 7300 rows) for each contiguous
> group of 50 x 50 cell.
> 
> CREATE TABLE cells (
>   cell_id INTEGER PRIMARY KEY,
>   other cell attributes,
>   lat,
>   lon,
>   met_grid_id INTEGER
> );
> 
> cell_id met_grid_id
> --- 
> 0   0
> 1   0
> ..
> 24990
> 25001
> 25011
> ..
> 49991
> 
> CREATE TABLE met (
>   met_id INTEGER PRIMARY KEY,
>   other met attributes,
>   met_grid_id INTEGER
> );
> 
> met_id met_grid_id
> -- ---
> 0  0
> 1  0
> ..
> 7299   0
> 7300   1
> 7301   1
> ..
> 
> CREATE VIRTUAL TABLE cell_index USING rtree (
>   cell_id INTEGER,
>   minx REAL,
>   maxx REAL,
>   miny REAL,
>   maxy REAL
> )
> 
> The db is about 350 MB with the cell table with 1000,000 rows and the
> met table with 2,920,000 rows and the R*Tree index.
> 
> Is there any other better way that jumps out at any of you?
> 
> 

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] designing a db to hold repeating data

2009-03-15 Thread P Kishor
I have a grid of 1000 x 1000 cells with their own data as well as 20
years of daily weather data (20 * 365 = 7300 rows) for each contiguous
group of 50 x 50 cell.

CREATE TABLE cells (
cell_id INTEGER PRIMARY KEY,
other cell attributes,
lat,
lon,
met_grid_id INTEGER
);

cell_id met_grid_id
--- 
0   0
1   0
..
24990
25001
25011
..
49991

CREATE TABLE met (
met_id INTEGER PRIMARY KEY,
other met attributes,
met_grid_id INTEGER
);

met_id met_grid_id
-- ---
0  0
1  0
..
7299   0
7300   1
7301   1
..

CREATE VIRTUAL TABLE cell_index USING rtree (
cell_id INTEGER,
minx REAL,
maxx REAL,
miny REAL,
maxy REAL
)

The db is about 350 MB with the cell table with 1000,000 rows and the
met table with 2,920,000 rows and the R*Tree index.

Is there any other better way that jumps out at any of you?


-- 
Puneet Kishor http://www.punkish.org/
Nelson Institute for Environmental Studies http://www.nelson.wisc.edu/
Carbon Model http://carbonmodel.org/
Open Source Geospatial Foundation http://www.osgeo.org/
Sent from: Madison WI United States.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users