Re: [PERFORM] best db schema for time series data?

2010-11-19 Thread Louis-David Mitterrand
On Tue, Nov 16, 2010 at 11:35:24AM -0500, Chris Browne wrote:
 vindex+lists-pgsql-performa...@apartia.org (Louis-David Mitterrand)
 writes:
  I have to collect lots of prices from web sites and keep track of their
  changes. What is the best option?
 
  1) one 'price' row per price change:
 
  create table price (
  id_price primary key,
  id_product integer references product,
  price integer
  );
 
  2) a single 'price' row containing all the changes:
 
  create table price (
  id_price primary key,
  id_product integer references product,
  price integer[] -- prices are 'pushed' on this array as they 
  change
  );
 
  Which is bound to give the best performance, knowing I will often need
  to access the latest and next-to-latest prices?
 
 I'd definitely bias towards #1, but with a bit of a change...
 
 create table product (
   id_product serial primary key
 );
 
 create table price (
id_product integer references product,
as_at timestamptz default now(),
primary key (id_product, as_at),
price integer
 );

Hi Chris,

So an id_price serial on the price table is not necessary in your
opinion? I am using order by id_price limit X or max(id_price) to
get at the most recent prices.

 The query to get the last 5 prices for a product should be
 splendidly efficient:
 
select price, as_at from price
 where id_product = 17
 order by as_at desc limit 5;
 
 (That'll use the PK index perfectly nicely.)
 
 If you needed higher performance, for latest price, then I'd add a
 secondary table, and use triggers to copy latest price into place:
 
   create table latest_prices (
  id_product integer primary key references product,
  price integer
   );

I did the same thing with a 'price_dispatch' trigger and partitioned
tables (inheritance). It's definitely needed when the price database
grow into the millions.

Thanks,

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] best db schema for time series data?

2010-11-19 Thread Louis-David Mitterrand
On Tue, Nov 16, 2010 at 05:28:19PM +0100, Harald Fuchs wrote:
 In article 4ce2688b.2050...@tweakers.net,
 Arjen van der Meijden acmmail...@tweakers.net writes:
 
  On 16-11-2010 11:50, Louis-David Mitterrand wrote:
  I have to collect lots of prices from web sites and keep track of their
  changes. What is the best option?
  
  1) one 'price' row per price change:
  
  create table price (
  id_price primary key,
  id_product integer references product,
  price integer
  );
  
  2) a single 'price' row containing all the changes:
  
  create table price (
  id_price primary key,
  id_product integer references product,
  price integer[] -- prices are 'pushed' on this array as they change
  );
  
  Which is bound to give the best performance, knowing I will often need
  to access the latest and next-to-latest prices?
 
  If you mostly need the last few prices, I'd definitaly go with the
  first aproach, its much cleaner. Besides, you can store a date/time
  per price, so you know when it changed. With the array-approach that's
  a bit harder to do.
 
 I'd probably use a variant of this:
 
   CREATE TABLE prices (
 pid int NOT NULL REFERENCES products,
 validTil timestamp(0) NULL,
 price int NOT NULL,
 UNIQUE (pid, validTil)
   );
 
 The current price of a product is always the row with validTil IS NULL.
 The lookup should be pretty fast because it can use the index of the
 UNIQUE constraint.

Hi,

The validTil idea is nice, but you have to manage that field with a
trigger, right?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Defaulting wal_sync_method to fdatasync on Linux for 9.1?

2010-11-19 Thread Jignesh Shah
On Tue, Nov 16, 2010 at 8:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Josh Berkus j...@agliodbs.com writes:
 Well, we're not going to increase the default to gigabytes, but we could
 very probably increase it by a factor of 10 or so without anyone
 squawking.  It's been awhile since I heard of anyone trying to run PG in
 4MB shmmax.  How much would a change of that size help?

 Last I checked, though, this comes out of the allocation available to
 shared_buffers.  And there definitely are several OSes (several linuxes,
 OSX) still limited to 32MB by default.

 Sure, but the current default is a measly 64kB.  We could increase that
 10x for a relatively small percentage hit in the size of shared_buffers,
 if you suppose that there's 32MB available.  The current default is set
 to still work if you've got only a couple of MB in SHMMAX.

 What we'd want is for initdb to adjust the setting as part of its
 probing to see what SHMMAX is set to.

                        regards, tom lane



In all the performance tests that I have done, generally I get a good
bang for the buck with wal_buffers set to 512kB in low memory cases
and mostly I set it to 1MB which is probably enough for most of the
cases even with high memory.

That 1/2 MB wont make drastic change on shared_buffers anyway (except
for edge cases) but will relieve the stress quite a bit on wal
buffers.

Regards,
Jignesh

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] best db schema for time series data?

2010-11-19 Thread Chris Browne
vindex+lists-pgsql-performa...@apartia.org (Louis-David Mitterrand)
writes:
 On Tue, Nov 16, 2010 at 11:35:24AM -0500, Chris Browne wrote:
 vindex+lists-pgsql-performa...@apartia.org (Louis-David Mitterrand)
 writes:
  I have to collect lots of prices from web sites and keep track of their
  changes. What is the best option?
 
  1) one 'price' row per price change:
 
 create table price (
 id_price primary key,
 id_product integer references product,
 price integer
 );
 
  2) a single 'price' row containing all the changes:
 
 create table price (
 id_price primary key,
 id_product integer references product,
 price integer[] -- prices are 'pushed' on this array as they 
  change
 );
 
  Which is bound to give the best performance, knowing I will often need
  to access the latest and next-to-latest prices?
 
 I'd definitely bias towards #1, but with a bit of a change...
 
 create table product (
   id_product serial primary key
 );
 
 create table price (
id_product integer references product,
as_at timestamptz default now(),
primary key (id_product, as_at),
price integer
 );

 Hi Chris,

 So an id_price serial on the price table is not necessary in your
 opinion? I am using order by id_price limit X or max(id_price) to
 get at the most recent prices.

It (id_price) is an extra piece of information that doesn't reveal an
important fact, namely when the price was added.

I'm uncomfortable with adding data that doesn't provide much more
information, and it troubles me when people put a lot of interpretation
into the meanings of SERIAL columns.

I'd like to set up some schemas (for experiment, if not necessarily to
get deployed to production) where I'd use DCE UUID values rather than
sequences, so that people wouldn't make the error of imagining meanings
in the values that aren't really there.  

And I suppose that there lies a way to think about it...  If you used
UUIDs rather than SERIAL, how would your application break?  

And of the ways in which it would break, which of those are errors that
fall from:

 a) Ignorant usage, assuming order that isn't really there?  (e.g. - a
SERIAL might capture some order information, but UUID won't!)

 b) Inadequate data capture, where you're using the implicit data
collection from SERIAL to capture, poorly, information that should
be expressly captured?

When I added the timestamp to the price table, that's intended to
address b), capturing the time that the price was added.

 The query to get the last 5 prices for a product should be
 splendidly efficient:
 
select price, as_at from price
 where id_product = 17
 order by as_at desc limit 5;
 
 (That'll use the PK index perfectly nicely.)
 
 If you needed higher performance, for latest price, then I'd add a
 secondary table, and use triggers to copy latest price into place:
 
   create table latest_prices (
  id_product integer primary key references product,
  price integer
   );

 I did the same thing with a 'price_dispatch' trigger and partitioned
 tables (inheritance). It's definitely needed when the price database
 grow into the millions.

 Thanks,

The conversations are always interesting!  Cheers!
-- 
output = (cbbrowne @ gmail.com)
http://www3.sympatico.ca/cbbrowne/x.html
FLORIDA: If you think we can't vote, wait till you see us drive.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] best db schema for time series data?

2010-11-19 Thread Robert Klemme
On Fri, Nov 19, 2010 at 10:50 AM, Louis-David Mitterrand
vindex+lists-pgsql-performa...@apartia.org wrote:
 On Tue, Nov 16, 2010 at 05:28:19PM +0100, Harald Fuchs wrote:
 In article 4ce2688b.2050...@tweakers.net,
 Arjen van der Meijden acmmail...@tweakers.net writes:

  On 16-11-2010 11:50, Louis-David Mitterrand wrote:
  I have to collect lots of prices from web sites and keep track of their
  changes. What is the best option?
 
  1) one 'price' row per price change:
 
  create table price (
  id_price primary key,
  id_product integer references product,
  price integer
  );
 
  2) a single 'price' row containing all the changes:
 
  create table price (
  id_price primary key,
  id_product integer references product,
  price integer[] -- prices are 'pushed' on this array as they change
  );
 
  Which is bound to give the best performance, knowing I will often need
  to access the latest and next-to-latest prices?

  If you mostly need the last few prices, I'd definitaly go with the
  first aproach, its much cleaner. Besides, you can store a date/time
  per price, so you know when it changed. With the array-approach that's
  a bit harder to do.

 I'd probably use a variant of this:

   CREATE TABLE prices (
     pid int NOT NULL REFERENCES products,
     validTil timestamp(0) NULL,
     price int NOT NULL,
     UNIQUE (pid, validTil)
   );

 The current price of a product is always the row with validTil IS NULL.
 The lookup should be pretty fast because it can use the index of the
 UNIQUE constraint.

Even better: with a partial index lookup should be more efficient and
probably will stay that way even when the number of prices increases
(and the number of products stays the same).  With

CREATE UNIQUE INDEX current_prices
ON prices (
  pid
)
WHERE validTil IS NULL;

I get

robert= explain select price from prices where pid = 12344 and
validTil is null;
 QUERY PLAN
-
 Index Scan using current_prices on prices  (cost=0.00..8.28 rows=1 width=4)
   Index Cond: (pid = 12344)
(2 rows)

The index can actually be used here.

(see attachment)

 The validTil idea is nice, but you have to manage that field with a
 trigger, right?

Well, you don't need to.  You can always do

begin;
update prices set validTil = current_timestamp
  where pid = 123 and validTil is NULL;
insert into prices values ( 123, null, 94 );
commit;

But with a trigger it would be more convenient of course.

Kind regards

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
\timing on

DROP TABLE prices;

CREATE TABLE prices (
  pid int NOT NULL, -- REFERENCES products,
  validTil timestamp(0) NULL,
  price int NOT NULL,
  UNIQUE (pid, validTil)
);

CREATE UNIQUE INDEX current_prices
ON prices (
  pid
)
WHERE validTil IS NULL;

INSERT INTO prices
SELECT generate_series,
CASE MOD(generate_series,10)
WHEN 0 THEN NULL
ELSE current_timestamp - MOD(generate_series,10) * interval '1' day
END,
123
FROM generate_series(1,100);

explain select price from prices where pid = 12344 and validTil is null;


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] autovacuum blocks the operations of other manual vacuum

2010-11-19 Thread Alvaro Herrera
Excerpts from kuopo's message of jue nov 18 04:10:36 -0300 2010:
 Hi,
 
 Thanks for your response. I've checked it again and found that the
 main cause is the execution of ANALYZE. As I have mentioned, I have
 two tables: table A is a big one (around 10M~100M records) for log
 data and table B is a small one (around 1k records) for keeping some
 current status. There are a lot of update operations and some search
 operations on the table B. For the performance issue, I would like to
 keep table B as compact as possible. According your suggestion, I try
 to invoke standard vacuum (not full) more frequently (e.g., once per
 min).
 
 However, when I analyze the table A, the autovacuum or vacuum on the
 table B cannot find any removable row version (the number of
 nonremoveable row versions and pages keeps increasing). After the
 analysis finishes, the search operations on the table B is still
 inefficient. If I call full vacuum right now, then I can have quick
 response time of the search operations on the table B again.

Hmm, I don't think we can optimize the analyze-only operation the same
way we optimize vacuum (i.e. allow vacuum to proceed while it's in
progress).  Normally analyze shouldn't take all that long anyway -- why
is it that slow?  Are you calling it in a transaction that also does
other stuff?  Are you analyzing more than one table in a single
transaction, perhaps even the whole database?

Perhaps you could speed it up by lowering vacuum_cost_delay, if it's set
to a nonzero value.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] autovacuum blocks the operations of other manual vacuum

2010-11-19 Thread tv
 Excerpts from kuopo's message of jue nov 18 04:10:36 -0300 2010:
 However, when I analyze the table A, the autovacuum or vacuum on the
 table B cannot find any removable row version (the number of
 nonremoveable row versions and pages keeps increasing). After the
 analysis finishes, the search operations on the table B is still
 inefficient. If I call full vacuum right now, then I can have quick
 response time of the search operations on the table B again.

Hi, I don't know how to fix the long VACUUM/ANALYZE, but have you tried to
minimize the growth using HOT?

HOT means that if you update only columns that are not indexed, and if the
update can fit into the same page (into an update chain), this would not
create a dead row.

Are there any indexes on the small table? How large is it? You've
mentioned there are about 2049 rows - that might be just a few pages so
the indexes would not be very efficient anyway.

Try to remove the indexes, and maybe create the table with a smaller
fillfactor (so that there is more space for the updates).

That should be much more efficient and the table should not grow.

You can see if HOT works through pg_stat_all_tables view (columns
n_tup_upd and n_tup_hot_upd).

regards
Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance