[PERFORM] Thousands of tables versus on table?

2007-06-04 Thread Thomas Andrews
I have several thousand clients.  Our clients do surveys, and each survey
has two tables for the client data,

   responders
   responses

Frequent inserts into both table.

Right now, we are seeing significant time during inserts to these two
tables.

Some of the indices in tableA and tableB do not index on the client ID
first.

So, we are considering two possible solutions.

 (1) Create separate responders and responses tables for each client.

 (2) Make sure all indices on responders and responses start with the
  client id (excepting, possibly, the primary keys for these fields) and
  have all normal operation queries always include an id_client.

Right now, for example, given a responder and a survey question, we do a
query in responses by the id_responder and id_survey.  This gives us a
unique record, but I'm wondering if maintaining the index on
(id_responder,id_survey) is more costly on inserts than maintaining the
index (id_client,id_responder,id_survey) given that we also have other
indices on (id_client,...).

Option (1) makes me very nervous.  I don't like the idea of the same sorts
of data being stored in lots of different tables, in part for long-term
maintenance reasons.  We don't really need cross-client reporting, however.

=thomas


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Thousands of tables versus on table?

2007-06-04 Thread Thomas Andrews
Oh, and we vacuum every day.  Not sure about REINDEX, but I doubt we 
have done that.


=thomas

Mark Lewis wrote:

On Mon, 2007-06-04 at 13:40 -0400, Thomas Andrews wrote:

I have several thousand clients.  Our clients do surveys, and each survey
has two tables for the client data,

   responders
   responses

Frequent inserts into both table.

Right now, we are seeing significant time during inserts to these two
tables.


Can you provide some concrete numbers here?  Perhaps an EXPLAIN ANALYZE
for the insert, sizes of tables, stuff like that?


Some of the indices in tableA and tableB do not index on the client ID
first.


What reason do you have to think that this matters?


So, we are considering two possible solutions.

 (1) Create separate responders and responses tables for each client.

 (2) Make sure all indices on responders and responses start with the
  client id (excepting, possibly, the primary keys for these fields) and
  have all normal operation queries always include an id_client.

Right now, for example, given a responder and a survey question, we do a
query in responses by the id_responder and id_survey.  This gives us a
unique record, but I'm wondering if maintaining the index on
(id_responder,id_survey) is more costly on inserts than maintaining the
index (id_client,id_responder,id_survey) given that we also have other
indices on (id_client,...).

Option (1) makes me very nervous.  I don't like the idea of the same sorts
of data being stored in lots of different tables, in part for long-term
maintenance reasons.  We don't really need cross-client reporting, however.


What version of PG is this?  What is your vacuuming strategy?  Have you
tried a REINDEX to see if that helps?

-- Mark Lewis



begin:vcard
fn:Thomas Andrews
n:Andrews;Thomas
org:Soliant Consulting
adr;dom:;; 14 N. Peoria Street 2H;Chicago;IL;60607
email;internet:[EMAIL PROTECTED]
title:Application Engineer
tel;work:617-868-5580
tel;home:617-868-5580
tel;cell:617-599-8728
x-mozilla-html:FALSE
url:http://www.soliantconsulting.com/
version:2.1
end:vcard


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Thousands of tables versus on table?

2007-06-04 Thread Thomas Andrews

We're running 7.4 but will be upgrading to 8.2.

The responses table has 20,000,000 records.

Sometimes (but not all the time) an insert into the responses table can 
take 5-6 seconds.


I guess my real question is, does it ever make sense to create thousands 
of tables like this?


=thomas

Mark Lewis wrote:

On Mon, 2007-06-04 at 13:40 -0400, Thomas Andrews wrote:

I have several thousand clients.  Our clients do surveys, and each survey
has two tables for the client data,

   responders
   responses

Frequent inserts into both table.

Right now, we are seeing significant time during inserts to these two
tables.


Can you provide some concrete numbers here?  Perhaps an EXPLAIN ANALYZE
for the insert, sizes of tables, stuff like that?


Some of the indices in tableA and tableB do not index on the client ID
first.


What reason do you have to think that this matters?


So, we are considering two possible solutions.

 (1) Create separate responders and responses tables for each client.

 (2) Make sure all indices on responders and responses start with the
  client id (excepting, possibly, the primary keys for these fields) and
  have all normal operation queries always include an id_client.

Right now, for example, given a responder and a survey question, we do a
query in responses by the id_responder and id_survey.  This gives us a
unique record, but I'm wondering if maintaining the index on
(id_responder,id_survey) is more costly on inserts than maintaining the
index (id_client,id_responder,id_survey) given that we also have other
indices on (id_client,...).

Option (1) makes me very nervous.  I don't like the idea of the same sorts
of data being stored in lots of different tables, in part for long-term
maintenance reasons.  We don't really need cross-client reporting, however.


What version of PG is this?  What is your vacuuming strategy?  Have you
tried a REINDEX to see if that helps?

-- Mark Lewis



begin:vcard
fn:Thomas Andrews
n:Andrews;Thomas
org:Soliant Consulting
adr;dom:;; 14 N. Peoria Street 2H;Chicago;IL;60607
email;internet:[EMAIL PROTECTED]
title:Application Engineer
tel;work:617-868-5580
tel;home:617-868-5580
tel;cell:617-599-8728
x-mozilla-html:FALSE
url:http://www.soliantconsulting.com/
version:2.1
end:vcard


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Thousands of tables versus on table?

2007-06-04 Thread Thomas Andrews



On 6/4/07 3:43 PM, "Gregory Stark" <[EMAIL PROTECTED]> wrote:

> 
> "Thomas Andrews" <[EMAIL PROTECTED]> writes:
> 
>> I guess my real question is, does it ever make sense to create thousands of
>> tables like this?
> 
> Sometimes. But usually it's not a good idea.
> 
> What you're proposing is basically partitioning, though you may not actually
> need to put all the partitions together for your purposes. Partitioning's main
> benefit is in the management of the data. You can drop and load partitions in
> chunks rather than have to perform large operations on millions of records.
> 
> Postgres doesn't really get any faster by breaking the tables up like that. In
> fact it probably gets slower as it has to look up which of the thousands of
> tables you want to work with.
> 
> How often do you update or delete records and how many do you update or
> delete? Once per day is a very low frequency for vacuuming a busy table, you
> may be suffering from table bloat. But if you never delete or update records
> then that's irrelevant.

It looks like the most inserts that have occurred in a day is about 2000.
The responders table has 1.3 million records, the responses table has 50
million records.  Most of the inserts are in the responses table.

> 
> Does reindexing or clustering the table make a marked difference?
> 

Clustering sounds like it might be a really good solution.  How long does a
cluster command usually take on a table with 50,000,000 records?  Is it
something that can be run daily/weekly?

I'd rather not post the schema because it's not mine - I'm a consultant.  I
can tell you our vacuum every night is taking 2 hours and that disk IO is
the real killer - the CPU rarely gets higher than 20% or so.

=thomas


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Thousands of tables versus on table?

2007-06-04 Thread Thomas Andrews
Linux 2.4.9, if I¹m reading this right.

=thomas


On 6/4/07 4:08 PM, "Y Sidhu" <[EMAIL PROTECTED]> wrote:

> On 6/4/07, Thomas Andrews <[EMAIL PROTECTED]> wrote:
>> 
>> 
>> 
>> On 6/4/07 3:43 PM, "Gregory Stark" <[EMAIL PROTECTED]> wrote:
>> 
>>> >
>>> > "Thomas Andrews" < [EMAIL PROTECTED]
>>> <mailto:[EMAIL PROTECTED]> > writes:
>>> >
>>>> >> I guess my real question is, does it ever make sense to create thousands
of
>>>> >> tables like this?
>>> >
>>> > Sometimes. But usually it's not a good idea.
>>> >
>>> > What you're proposing is basically partitioning, though you may not
>>> actually
>>> > need to put all the partitions together for your purposes. Partitioning's
>>> main
>>> > benefit is in the management of the data. You can drop and load partitions
>>> in 
>>> > chunks rather than have to perform large operations on millions of
>>> records.
>>> >
>>> > Postgres doesn't really get any faster by breaking the tables up like
>>> that. In
>>> > fact it probably gets slower as it has to look up which of the thousands
>>> of 
>>> > tables you want to work with.
>>> >
>>> > How often do you update or delete records and how many do you update or
>>> > delete? Once per day is a very low frequency for vacuuming a busy table,
>>> you
>>> > may be suffering from table bloat. But if you never delete or update
>>> records 
>>> > then that's irrelevant.
>> 
>> It looks like the most inserts that have occurred in a day is about 2000.
>> The responders table has 1.3 million records, the responses table has 50
>> million records.  Most of the inserts are in the responses table.
>> 
>>> >
>>> > Does reindexing or clustering the table make a marked difference?
>>> >
>> 
>> Clustering sounds like it might be a really good solution.  How long does a
>> cluster command usually take on a table with 50,000,000 records?  Is it
>> something that can be run daily/weekly?
>> 
>> I'd rather not post the schema because it's not mine - I'm a consultant.  I
>> can tell you our vacuum every night is taking 2 hours and that disk IO is
>> the real killer - the CPU rarely gets higher than 20% or so.
>> 
>> =thomas
>> 
>> 
>> ---(end of broadcast)---
>> TIP 5: don't forget to increase your free space map settings
> 
> 
> What OS are you running on?
> 




Re: [PERFORM] Thousands of tables versus on table?

2007-06-05 Thread Thomas Andrews
So, partitioning in PSQL 8 is workable, but breaking up the table up into
actual separate tables is not?

Another solution we have proposed is having 'active' and 'completed' tables.
So, rather than thousands, we'd have four tables:

   responders_active
   responders_completed
   responses_active
   responses_completed

That way, the number of responses_active records would not be as huge.  The
problem, as we see it, is that the responders are entering their responses
and it is taking too long.  But if we separate out active and  completed
surveys, then the inserts will likely cost less.  We might even be able to
reduce the indices on the _active tables because survey administrators would
not want to run as many complex reports on the active responses.

There would be an extra cost, when the survey is completed, of copying the
records from the '_active' table to the '_completed' table and then deleting
them, but that operation is something a survey administrator would be
willing to accept as taking a while (as well as something we could put off
to an off hour, although we have lots of international customers so it's not
clear when our off hours are.)

=thomas


On 6/5/07 12:48 PM, "Scott Marlowe" <[EMAIL PROTECTED]> wrote:

> [EMAIL PROTECTED] wrote:
>> On Mon, 4 Jun 2007, Scott Marlowe wrote:
>> 
>>> Gregory Stark wrote:
>>>>  "Thomas Andrews" <[EMAIL PROTECTED]> writes:
>>>> 
>>>> 
>>>>>  I guess my real question is, does it ever make sense to create
>>>> thousands >  of
>>>>>  tables like this?
>>>>> 
>>>>  Sometimes. But usually it's not a good idea.
>>>>  What you're proposing is basically partitioning, though you may not
>>>>  actually
>>>>  need to put all the partitions together for your purposes.
>>>> Partitioning's
>>>>  main
>>>>  benefit is in the management of the data. You can drop and load
>>>> partitions
>>>>  in
>>>>  chunks rather than have to perform large operations on millions of
>>>>  records.
>>>> 
>>>>  Postgres doesn't really get any faster by breaking the tables up like
>>>>  that. In
>>>>  fact it probably gets slower as it has to look up which of the
>>>> thousands
>>>>  of
>>>>  tables you want to work with.
>>>> 
>>> 
>>> That's not entirely true.  PostgreSQL can be markedly faster using
>>> partitioning as long as you always access it by referencing the
>>> partitioning key in the where clause.  So, if you partition the table
>>> by date, and always reference it with a date in the where clause, it
>>> will usually be noticeably faster.  OTOH, if you access it without
>>> using a where clause that lets it pick partitions, then it will be
>>> slower than one big table.
>>> 
>>> So, while this poster might originally think to have one table for
>>> each user, resulting in thousands of tables, maybe a compromise where
>>> you partition on userid ranges would work out well, and keep each
>>> partition table down to some 50-100 thousand rows, with smaller
>>> indexes to match.
>>> 
>> 
>> what if he doesn't use the postgres internal partitioning, but instead
>> makes his code access the tables named responsesN where N is
>> the id of the customer?
>> 
>> this is what it sounded like he was asking initially.
> 
> Sorry, I think I initially read your response as "Postgres doesn't
> really get any faster by breaking the tables up" without the "like that"
> part.
> 
> I've found that as long as the number of tables is > 10,000 or so,
> having a lot of tables doesn't seem to really slow pgsql down a lot.
> I'm sure that the tipping point is dependent on your db machine.  I
> would bet that if he's referring to individual tables directly, and each
> one has hundreds instead of millions of rows, the performance would be
> better.  But the only way to be sure is to test it.


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate