Re: [GENERAL] How Big is Too Big for Tables?

2010-07-28 Thread P Kishor
On Wed, Jul 28, 2010 at 12:03 PM, Joshua D. Drake j...@commandprompt.com 
wrote:
 On Wed, 2010-07-28 at 11:09 -0600, Bill Thoen wrote:
 I'm building a national database of agricultural information and one
 of the layers is a bit more than a gigabyte per state. That's 1-2
 million records per state, with a mult polygon geometry, and i've got
 about 40 states worth of data. I trying to store everything in a
 single PG table. What I'm concerned about is if I combine every state
 into one big table then will performance will be terrible, even with
 indexes? On the other hand, if I store the data in several smaller
 files, then if a user zooms in on a multi-state region,  I've got  to
 build or find a much more complicated way to query multiple files.

 So I'm wondering, should I be concerned with building a single
 national size table (possibly 80-100 Gb) for all these records, or
 should I keep the files smaller and hope there's something like
 ogrtindex out there for PG tables? what do you all recommend in this
 case?

 80-100Gb isn't that much. However it may be worth looking into
 partitioning by state.


See http://archives.postgresql.org/pgsql-general/2010-07/msg00691.php
for details, but here is a summary.

My experience has not been the greatest. I have been trying to figure
out if I can store a few hundred million rows, and have experienced a
great number of problems.

One. Loading the data is a problem. COPY is the quickest way (I was
able to achieve a max of about 20,000 inserts per second). However,
you need to make sure there are no indexes, not even a primary key, in
order to extract maximum speed. That means, you have to load
*everything* in one go. If you load in stages, you have drop all the
indexes, then load, then rebuild the indexes. Next time you want to
load more data, you to repeat this process. Building the indexes takes
a long time, so experimenting is a chore.

Two. Partitioning is not the perfect solution. My database will
ultimately have about 13 million rows per day (it is daily data) for
about 25 years. So, I need either --

- One big table with 25 * 365 * 13 million rows. Completely undoable.
- 25 yearly tables with 365 * 13 million rows each. Still a huge
chore, very slow queries.
- 25 * 365 tables with 13 million rows each. More doable, but
partitioning doesn't work.

Three. At least, in my case, the overhead is too much. My data are
single bytes, but the smallest data type in Pg is smallint (2 bytes).
That, plus the per row overhead adds to a fair amount of overhead.

I haven't yet given up on storing this specific dataset in Pg, but am
reconsidering. It is all readonly data, so flat files might be better
for me.

In other words, Pg is great, but do tests, benchmark, research before
committing to a strategy. Of course, since you are storing geometries,
Pg is a natural choice for you. My data are not geometries, so I can
explore alternatives for it, while keeping my geographic data in Pg.

Hope this helps, or, at least provides an alternative view point.


 Sincerely,

 Joshua D. Drake

 --
 PostgreSQL.org Major Contributor
 Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
 Consulting, Training, Support, Custom Development, Engineering
 http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt


 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general




-- 
Puneet Kishor http://www.punkish.org
Carbon Model http://carbonmodel.org
Charter Member, Open Source Geospatial Foundation http://www.osgeo.org
Science Commons Fellow, http://sciencecommons.org/about/whoweare/kishor
Nelson Institute, UW-Madison http://www.nelson.wisc.edu
---
Assertions are politics; backing up assertions with evidence is science
===

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How Big is Too Big for Tables?

2010-07-28 Thread P Kishor
On Wed, Jul 28, 2010 at 1:38 PM, Stephen Frost sfr...@snowman.net wrote:
 * P Kishor (punk.k...@gmail.com) wrote:
 Three. At least, in my case, the overhead is too much. My data are
 single bytes, but the smallest data type in Pg is smallint (2 bytes).
 That, plus the per row overhead adds to a fair amount of overhead.

 My first reaction to this would be- have you considered aggregating the
 data before putting it into the database in such a way that you put more
 than 1 byte of data on each row..?  That could possibly reduce the
 number of rows you have by quite a bit and also reduce the impact of the
 per-tuple overhead in PG..


each row is half a dozen single byte values, so, it is actually 6
bytes per row (six columns). Even if I combine them somehow, still the
per row overhead (which, I believe, is about 23 bytes) is more than
the data. But, that is not the issue. First, I can't really merge
several days into one row. While it might make for fewer rows, it will
complicate my data extraction and analysis life very complicated.

The real issue is that once I put a 100 million rows in the table,
basically the queries became way too slow. Of course, I could (and
should) upgrade my hardware -- I am using a dual Xeon 3 GHz server
with 12 GB RAM, but there are limits to that route.

Keep in mind, the circa 100 million rows was for only part of the db.
If I were to build the entire db, I would have about 4 billion rows
for a year, if I were to partition the db by years. And, partitioning
by days resulted in too many tables.

I wish there were a way around all this so I could use Pg, with my
available resources, but it looks bleak right now.



        Thanks,

                Stephen

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.9 (GNU/Linux)

 iEYEARECAAYFAkxQeSIACgkQrzgMPqB3kihjYgCeMx2awmTE4IfAHgtws8iKhteN
 cnMAoIp2g2Zfo00GC7du16nwBht3Kt1O
 =7tdl
 -END PGP SIGNATURE-





-- 
Puneet Kishor http://www.punkish.org
Carbon Model http://carbonmodel.org
Charter Member, Open Source Geospatial Foundation http://www.osgeo.org
Science Commons Fellow, http://sciencecommons.org/about/whoweare/kishor
Nelson Institute, UW-Madison http://www.nelson.wisc.edu
---
Assertions are politics; backing up assertions with evidence is science
===

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] optimizing daily data storage in Pg

2010-07-22 Thread P Kishor
I have been struggling with this for a while now, have even gone down
a few paths but struck out, so I turn now to the community for ideas.
First, the problem: Store six daily variables for ~ 25 years for cells
in a grid.

 * Number of vars = 6
 * Number of cells ~ 13 million
 * Number of days ~ 9125 (25 * 365)

Optimize the store for two different kinds of queries:

Query one: Retrieve the value of a single var for all or a portion of
the cells for a single day. This is analogous to an image where every
pixel is the value of a single var.

Query two: Retrieve values for all the days or a duration of days for
a single var for a single cell. This is like grabbing a column out of
a table in which each row holds all the vars for a single day.

So, I set about designing the db. The grid is in a table with 13 million rows

CREATE TABLE cells (
cell_id  INTEGER,
other_data ..
)
WITH (
OIDS=FALSE
)


A single table *where every row is one day's values for one cell* looks like so

CREATE TABLE d (
yr  SMALLINT,
ydaySMALLINT,
a   SMALLINT,
b   SMALLINT,
d   SMALLINT,
e   SMALLINT,
f   SMALLINT,
g   SMALLINT,
cell_id INTEGER
)
WITH (
OIDS=FALSE
)

The data would look like so

yr  ydaya   b   c   d   e   f   g   cell_id

19801   x   x   x   x   x   x   x   1
..
1980365 x   x   x   x   x   x   x   1
...
19811   x   x   x   x   x   x   x   1
..
1981365 x   x   x   x   x   x   x   1
  ...
  ...
20051   x   x   x   x   x   x   x   1
..
2005365 x   x   x   x   x   x   x   1
..
19801   x   x   x   x   x   x   x   2
..
1980365 x   x   x   x   x   x   x   2
  ...

I could now (theoretically) conduct my queries like so:

Query 1a: Retrieve the value of a single var for all the cells for a
single day. This is analogous to an image where every pixel is the
value of a single var.

SELECT var FROM d WHERE yr = ? AND yday = ?;

I assuming I would need an index on yr and yday, or perhaps even a
compound index on (yr, yday).

Query 1b: Retrieve the value of a single var for a portion of the
cells for a single day. This is analogous to an image where every
pixel is the value of a single var.

SELECT var FROM d WHERE yr = ? AND yday = ? AND cell_id IN (?,?,?...);

I assuming I would need an index on yr and yday, or perhaps even a
compound index on (yr, yday) AND an index on cell_id.

Query 2: Retrieve values for all the days or a duration of days for a
single var for a single cell. This is like grabbing a column out of a
table in which each row holds all the vars for a single day.

SELECT var FROM d WHERE cell_id = ?;
SELECT var FROM d WHERE cell_id IN (?,?,?...);

Once again, an index on cell_id would assist in the above.

The problem: The above table would have 13 M * 9125 rows ~ 118 billion
rows. Huge indexes, slow queries, etc. In fact, major issues loading
the data in the first place. Since I am loading data in batches, I
drop the indexes (takes time), COPY data into the table (takes time),
build the indexes (takes time), experiment with improving the
performance (takes time), fail, rinse, lather, repeat. I actually
tried the above with a subset of data (around 100 M rows) and
experienced all of the above. I don't remember the query times, but
they were awful.

So, I partitioned the table into years like so

CREATE TABLE d_ (
CHECK ( yr =  )
) INHERITS (d)

Hmmm... still no satisfaction. I ended up with 1 master table + 25
inherited tables. Each of the year tables now had ~ 4.75 billion rows
(13 M * 365), and the queries were still very slow.

So, I partitioned it all by years and days like so

CREATE TABLE d__yday (
CHECK ( yr =  AND yday = yday )
) INHERITS (d)

Each table now has 13 million rows, and is reasonably fast (although
still not satisfactorily fast), but now I have 9K tables. That has its
own problems. I can't query the master table anymore as Pg tries to
lock all the tables and runs out of memory. Additionally, I can't
anymore conduct query two above. I could do something like

SELECT a FROM d_1980_1 WHERE cell_id = 1
UNION
SELECT a FROM d_1980_2 WHERE cell_id = 1
UNION
SELECT a FROM d_1980_3 WHERE cell_id = 1
UNION
SELECT a FROM d_1980_4 WHERE cell_id = 1
UNION
...

But the above is hardly optimal.

Any suggestions, ideas, brainstorms would be appreciated. Perhaps Pg,
or even a RDBMS, is not the right tool for this problem, in which
case, suggestion for alternatives would be welcome as well.

Right now I am testing this on a dual Xeon dual core 3 GHz Xserve with
12 GB RAM. The PGDATA directory is located on an attached RAID that is
configured as 

Re: [GENERAL] optimizing daily data storage in Pg

2010-07-22 Thread P Kishor
On Thu, Jul 22, 2010 at 4:56 PM, Andy Colson a...@squeakycode.net wrote:
 On 7/22/2010 9:41 AM, P Kishor wrote:

 I have been struggling with this for a while now, have even gone down
 a few paths but struck out, so I turn now to the community for ideas.
 First, the problem: Store six daily variables for ~ 25 years for cells
 in a grid.

  * Number of vars = 6
  * Number of cells ~ 13 million
  * Number of days ~ 9125 (25 * 365)

 Optimize the store for two different kinds of queries:

 Query one: Retrieve the value of a single var for all or a portion of
 the cells for a single day. This is analogous to an image where every
 pixel is the value of a single var.

      SELECTvar  FROM d WHERE yr = ? AND yday = ?;
      SELECTvar  FROM d WHERE yr = ? AND yday = ? AND cell_id IN
 (?,?,?...);



 Query two: Retrieve values for all the days or a duration of days for
 a single var for a single cell. This is like grabbing a column out of
 a table in which each row holds all the vars for a single day.

      SELECTvar  FROM d WHERE cell_id = ?;
      SELECTvar  FROM d WHERE cell_id IN (?,?,?...);



 First, I must admit to not reading your entire email.

I am not sure how to respond to your feedback give that you haven't
read the entire email. Nevertheless, thanks for writing...


 Second, Query 1 should be fast, regardless of how you layout the tables.

It is not fast. Right now I have data for about 250,000 cells loaded.
That comes to circa 92 million rows per year. Performance is pretty
sucky.



 Third, Query 2 will return 13M rows?  I dont think it matters how you layout
 the tables, returning 13M rows is always going to be slow.


Yes, I understand that. In reality I will never get 13 M rows. For
display purposes, I will probably get around 10,000 rows to 50,000
rows. When more rows are needed, it will be to feed a model, so that
can be offline (without an impatient human being waiting on the other
end).

Right now, my main problem is that I have either too many rows (~4 B
rows) in a manageable number of tables (25 tables) or manageable
number of rows (~13 M rows) in too many tables (~9000 tables).



 -Andy




-- 
Puneet Kishor http://www.punkish.org
Carbon Model http://carbonmodel.org
Charter Member, Open Source Geospatial Foundation http://www.osgeo.org
Science Commons Fellow, http://sciencecommons.org/about/whoweare/kishor
Nelson Institute, UW-Madison http://www.nelson.wisc.edu
---
Assertions are politics; backing up assertions with evidence is science
===

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general