Re: [PERFORM] Prefetch

2005-05-16 Thread Joel Fradkin
My only comment is what is the layout of your data (just one table with
indexes?).
I found on my date with dozens of joins my view speed was not good for me to
use, so I made a flat file with no joins and it flies.

Joel Fradkin
 
Wazagua, Inc.
2520 Trailmate Dr
Sarasota, Florida 34243
Tel.  941-753-7111 ext 305
 
[EMAIL PROTECTED]
www.wazagua.com
Powered by Wazagua
Providing you with the latest Web-based technology  advanced tools.
C 2004. WAZAGUA, Inc. All rights reserved. WAZAGUA, Inc
 This email message is for the use of the intended recipient(s) and may
contain confidential and privileged information.  Any unauthorized review,
use, disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply email and delete and destroy
all copies of the original message, including attachments.
 

 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Matt Olson
Sent: Monday, May 09, 2005 9:10 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Prefetch

I wanted to get some opinions about row prefetching.  AFAIK, there is no 
prefetching done by PostgreSQL; all prefetching is delegated to the
operating 
system.  

The hardware (can't say enough good things about it):

Athlon 64, dual channel
4GB ram
240GB usable 4 disk raid5 (ATA133)
Fedora Core 3
PostgreSQL 7.4.7

I have what is essentially a data warehouse of stock data.  Each day has 
around 30,000 records (tickers).  A typical operation is to get the 200 day 
simple moving average (of price) for each ticker and write the result to a 
summary table.  In running this process (Perl/DBI), it is typical to see 
70-80% I/O wait time with postgres running a about 8-9%.   If I run the next

day's date, the postgres cache and file cache is now populated with 199 days

of the needed data, postgres runs 80-90% of CPU and total run time is
greatly 
reduced.  My conclusion is that this is a high cache hit rate in action.  

I've done other things that make sense, like using indexes, playing with the

planner constants and turning up the postgres cache buffers.  

Even playing with extream hdparm read-ahead numbers (i.e. 64738), there is
no 
apparent difference in database performance.  The random nature of the I/O 
drops disk reads down to about 1MB/sec for the array.  A linear table scan 
can easily yield 70-80MB/sec on this system.  Total table size is usually 
around 1GB and with indexes should be able to fit completely in main memory.

Other databases like Oracle and DB2 implement some sort of row prefetch.
Has 
there been serious consideration of implementing something like a prefetch 
subsystem?  Does anyone have any opinions as to why this would be a bad idea

for postgres?  

Postges is great for a multiuser environment and OLTP applications.
However, 
in this set up, a data warehouse, the observed performance is not what I 
would hope for.  

Regards,

Matt Olson
Ocean Consulting
http://www.oceanconsulting.com/

---(end of broadcast)---
TIP 8: explain analyze is your friend


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[PERFORM] Prefetch

2005-05-15 Thread Matt Olson
I wanted to get some opinions about row prefetching.  AFAIK, there is no 
prefetching done by PostgreSQL; all prefetching is delegated to the operating 
system.  

The hardware (can't say enough good things about it):

Athlon 64, dual channel
4GB ram
240GB usable 4 disk raid5 (ATA133)
Fedora Core 3
PostgreSQL 7.4.7

I have what is essentially a data warehouse of stock data.  Each day has 
around 30,000 records (tickers).  A typical operation is to get the 200 day 
simple moving average (of price) for each ticker and write the result to a 
summary table.  In running this process (Perl/DBI), it is typical to see 
70-80% I/O wait time with postgres running a about 8-9%.   If I run the next 
day's date, the postgres cache and file cache is now populated with 199 days 
of the needed data, postgres runs 80-90% of CPU and total run time is greatly 
reduced.  My conclusion is that this is a high cache hit rate in action.  

I've done other things that make sense, like using indexes, playing with the 
planner constants and turning up the postgres cache buffers.  

Even playing with extream hdparm read-ahead numbers (i.e. 64738), there is no 
apparent difference in database performance.  The random nature of the I/O 
drops disk reads down to about 1MB/sec for the array.  A linear table scan 
can easily yield 70-80MB/sec on this system.  Total table size is usually 
around 1GB and with indexes should be able to fit completely in main memory.

Other databases like Oracle and DB2 implement some sort of row prefetch.  Has 
there been serious consideration of implementing something like a prefetch 
subsystem?  Does anyone have any opinions as to why this would be a bad idea 
for postgres?  

Postges is great for a multiuser environment and OLTP applications.  However, 
in this set up, a data warehouse, the observed performance is not what I 
would hope for.  

Regards,

Matt Olson
Ocean Consulting
http://www.oceanconsulting.com/

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Prefetch

2005-05-11 Thread Rod Taylor
On Wed, 2005-05-11 at 12:53 +0800, Christopher Kings-Lynne wrote:
  Another trick you can use with large data sets like this when you want 
  results
  back in seconds is to have regularly updated tables that aggregate the data
  along each column normally aggregated against the main data set.
 
  Maybe some bright person will prove me wrong by posting some working
  information about how to get these apparently absent features working.
 
 Most people just use simple triggers to maintain aggregate summary tables...

Agreed. I've also got a view which calls a function that will 1) use the
summary table where data exists, or 2) calculate the summary
information, load it into summary table, and send a copy to the client
(partial query results cache).

It's not all nicely abstracted behind user friendly syntax, but most of
those features can be cobbled together (with effort) in PostgreSQL.
-- 


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Prefetch

2005-05-11 Thread Bricklen Anderson
Christopher Kings-Lynne wrote:
Another trick you can use with large data sets like this when you want 
results
back in seconds is to have regularly updated tables that aggregate the 
data
along each column normally aggregated against the main data set.

Maybe some bright person will prove me wrong by posting some working
information about how to get these apparently absent features working.

Most people just use simple triggers to maintain aggregate summary 
tables...

Chris
However, if (insert) triggers prove to be too much of a performance hit, try 
cron'd functions that perform the aggregation for you. This system works well 
for us, using the pk's (sequence) for start and stop points.

--
___
This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.
___
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] Prefetch

2005-05-11 Thread Mischa Sandberg
Quoting Christopher Kings-Lynne [EMAIL PROTECTED]:

  Another trick you can use with large data sets like this when you
 want 
  results
  back in seconds is to have regularly updated tables that aggregate
 the data
  along each column normally aggregated against the main data set.
 
  Maybe some bright person will prove me wrong by posting some
 working
  information about how to get these apparently absent features
 working.
 
 Most people just use simple triggers to maintain aggregate summary
 tables...

Don't know if this is more appropriate to bizgres, but:
What the first poster is talking about is what OLAP cubes do.

For big aggregating systems (OLAP), triggers perform poorly, 
compared to messy hand-rolled code. You may have dozens
of aggregates at various levels. Consider the effect of having 
each detail row cascade into twenty updates. 

It's particularly silly-looking when data is coming in as 
batches of thousands of rows in a single insert, e.g.

   COPY temp_table FROM STDIN;
   UPDATE fact_table ... FROM ... temp_table
   INSERT INTO fact_table ...FROM...temp_table

   (the above pair of operations is so common, 
Oracle added its MERGE operator for it).

Hence my recent post (request) for using RULES to aggregate 
--- given no luck with triggers FOR EACH STATEMENT.


---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] Prefetch

2005-05-10 Thread Matt Olson
I wanted to get some opinions about row prefetching.  AFAIK, there is no
prefetching done by PostgreSQL; all prefetching is delegated to the operating
system.

The hardware (can't say enough good things about it):

Athlon 64, dual channel
4GB ram
240GB usable 4 disk raid5 (ATA133)
Fedora Core 3
PostgreSQL 7.4.7

I have what is essentially a data warehouse of stock data.  Each day has
around 30,000 records (tickers).  A typical operation is to get the 200 day
simple moving average (of price) for each ticker and write the result to a
summary table.  In running this process (Perl/DBI), it is typical to see
70-80% I/O wait time with postgres running a about 8-9%.   If I run the next
day's date, the postgres cache and file cache is now populated with 199 days
of the needed data, postgres runs 80-90% of CPU and total run time is greatly
reduced.  My conclusion is that this is a high cache hit rate in action.

I've done other things that make sense, like using indexes, playing with the
planner constants and turning up the postgres cache buffers.

Even playing with extream hdparm read-ahead numbers (i.e. 64738) yields no
apparent difference in database performance.  The random nature of the I/O
drops disk reads down to about 1MB/sec for the array.  A linear table scan
can easily yield 70-80MB/sec on this system.  Total table size is usually
around 1GB and with indexes should be able to fit completely in main memory.

Other databases like Oracle and DB2 implement some sort of row prefetch.  Has
there been serious consideration of implementing something like a prefetch
subsystem?  Does anyone have any opinions as to why this would be a bad idea
for postgres?

Postges is great for a multiuser environment and OLTP applications.  However,
in this set up, a data warehouse, the observed performance is not what I
would hope for.

Regards,

Matt Olson
Ocean Consulting
http://www.oceanconsulting.com/


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Prefetch

2005-05-10 Thread Tom Lane
Matt Olson [EMAIL PROTECTED] writes:
 Other databases like Oracle and DB2 implement some sort of row prefetch.  Has
 there been serious consideration of implementing something like a prefetch
 subsystem?

No.

 Does anyone have any opinions as to why this would be a bad idea for
 postgres? 

We know even less than the OS does about disk layout, and not a lot more
than it about what our next request will be.  (If we're doing a seqscan,
then of course that's not true, but I would expect the OS to be able to
figure that one out and do readahead.)

You haven't shown us your problem queries, but I think that conventional
query tuning would be a more appropriate answer.  In particular I wonder
whether you shouldn't be looking at ways to calculate multiple
aggregates in parallel.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Prefetch

2005-05-10 Thread Rod Taylor
 I've done other things that make sense, like using indexes, playing with the
 planner constants and turning up the postgres cache buffers.

After you load the new days data try running CLUSTER on the structure
using a key of (stockID, date) -- probably your primary key.

This should significantly reduce the amount of IO required for your
calculations involving a few stocks over a period of time.

-- 


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Prefetch

2005-05-10 Thread Greg Stark

Matt Olson [EMAIL PROTECTED] writes:

 I've done other things that make sense, like using indexes, playing with the
 planner constants and turning up the postgres cache buffers.
 
 Even playing with extream hdparm read-ahead numbers (i.e. 64738) yields no
 apparent difference in database performance.  The random nature of the I/O
 drops disk reads down to about 1MB/sec for the array.  A linear table scan
 can easily yield 70-80MB/sec on this system.  Total table size is usually
 around 1GB and with indexes should be able to fit completely in main memory.

Actually forcing things to use indexes is the wrong direction to go if you're
trying to process lots of data and want to stream it off disk as rapidly as
possible. I would think about whether you can structure your data such that
you can use sequential scans. That might mean partitioning your raw data into
separate tables and then accessing only the partitions that are relevant to
the query.

In your application that might be hard. It sounds like you would need more or
less one table per stock ticker which would really be hard to manage.

One thing you might look into is using the CLUSTER command. But postgres
doesn't maintain the cluster ordering so it would require periodically
rerunning it.

I'm a bit surprised by your 1MB/s rate. I would expect to see about 10MB/s
even for completely random reads. Is it possible you're seeing something else
interfering? Do you have INSERT/UPDATE/DELETE transactions happening
concurrently with this select scan? If so you should strongly look into
separating the transaction log from the data files.

-- 
greg


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Prefetch

2005-05-10 Thread Matt Olson
My postgres binaries and WAL are on a separate disk from the raid array.  The 
table I'm doing the selects from is probably about 4GB in size and 18-20 
million records.  No concurrent or dependent inserts or deletes are going on.

Tom's point and your points about optimizing the application are well taken.  
I know my approach is sub optimal and prone to getting caught by latency 
issues (seek times, cache hit rates, etc.).  However, the question of 
prefetch in my mind is all about eliminating latencies, so, I thought my 
problem would be good for the sake of discussing prefetching.

The two approaches I'm in the process of testing are Rod and Greg's suggestion 
of using 'CLUSTER'.  And for the sake of not letting a good idea get away, 
I'll probably spend time on doing a parallel query approach which Tom 
suggested.  

I'll report back to the list what I find and maybe do some _rough_ 
benchmarking.  This is a production app, so I can't get too much in the way 
of the daily batches.  

-- 
Matt Olson
Ocean Consulting
http://www.oceanconsulting.com/

On Tuesday 10 May 2005 11:13 am, Greg Stark wrote:
 Matt Olson writes:
  I've done other things that make sense, like using indexes, playing with
  the planner constants and turning up the postgres cache buffers.
 
  Even playing with extream hdparm read-ahead numbers (i.e. 64738) yields
  no apparent difference in database performance.  The random nature of the
  I/O drops disk reads down to about 1MB/sec for the array.  A linear table
  scan can easily yield 70-80MB/sec on this system.  Total table size is
  usually around 1GB and with indexes should be able to fit completely in
  main memory.

 Actually forcing things to use indexes is the wrong direction to go if
 you're trying to process lots of data and want to stream it off disk as
 rapidly as possible. I would think about whether you can structure your
 data such that you can use sequential scans. That might mean partitioning
 your raw data into separate tables and then accessing only the partitions
 that are relevant to the query.

 In your application that might be hard. It sounds like you would need more
 or less one table per stock ticker which would really be hard to manage.

 One thing you might look into is using the CLUSTER command. But postgres
 doesn't maintain the cluster ordering so it would require periodically
 rerunning it.

 I'm a bit surprised by your 1MB/s rate. I would expect to see about 10MB/s
 even for completely random reads. Is it possible you're seeing something
 else interfering? Do you have INSERT/UPDATE/DELETE transactions happening
 concurrently with this select scan? If so you should strongly look into
 separating the transaction log from the data files.



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Prefetch

2005-05-10 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 Actually forcing things to use indexes is the wrong direction to go if you're
 trying to process lots of data and want to stream it off disk as rapidly as
 possible. I would think about whether you can structure your data such that
 you can use sequential scans.

Agreed.

 In your application that might be hard. It sounds like you would need more or
 less one table per stock ticker which would really be hard to manage.

Actually, in a previous lifetime I used to do pretty much the same stuff
Matt is working on.  The reason I suggested parallelizing is that what
you want is usually not so much the 200day moving average of FOO, as the
200day moving averages of a whole bunch of things.  If your input table
contains time-ordered data for all those things, then a seqscan works
out pretty well.

 One thing you might look into is using the CLUSTER command. But postgres
 doesn't maintain the cluster ordering so it would require periodically
 rerunning it.

If the desired sort order is time-based, it falls out pretty much for
free in this application, because historical data doesn't change -- you
are only interested in appending at the right.

In said previous lifetime, we used Postgres for tracking our actual
transactions, but we built a custom file format for storing the
individual tick data.  That's not stuff you need transactional semantics
for; the historical data is what it is.  Besides, you need to compress
it as much as you can because there's always too much of it.  Machines
are faster and disk space cheaper than they were at the time, but I'd
still question the wisdom of using a Postgres row for each daily bar,
let alone finer-grain data.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] Prefetch - OffTopic

2005-05-10 Thread Mohan, Ross
for time-series and insane fast, nothing beats kdB, I believe

www.kx.com

Not trying to Quisling-out PG here, just hoping to respond to Mr. Olson





-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane
Sent: Tuesday, May 10, 2005 2:54 PM
To: Greg Stark
Cc: [EMAIL PROTECTED]; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Prefetch 


Greg Stark [EMAIL PROTECTED] writes:
 Actually forcing things to use indexes is the wrong direction to go if 
 you're trying to process lots of data and want to stream it off disk 
 as rapidly as possible. I would think about whether you can structure 
 your data such that you can use sequential scans.

Agreed.

 In your application that might be hard. It sounds like you would need 
 more or less one table per stock ticker which would really be hard to 
 manage.

Actually, in a previous lifetime I used to do pretty much the same stuff Matt 
is working on.  The reason I suggested parallelizing is that what you want is 
usually not so much the 200day moving average of FOO, as the 200day moving 
averages of a whole bunch of things.  If your input table contains time-ordered 
data for all those things, then a seqscan works out pretty well.

 One thing you might look into is using the CLUSTER command. But 
 postgres doesn't maintain the cluster ordering so it would require 
 periodically rerunning it.

If the desired sort order is time-based, it falls out pretty much for free in 
this application, because historical data doesn't change -- you are only 
interested in appending at the right.

In said previous lifetime, we used Postgres for tracking our actual 
transactions, but we built a custom file format for storing the individual tick 
data.  That's not stuff you need transactional semantics for; the historical 
data is what it is.  Besides, you need to compress it as much as you can 
because there's always too much of it.  Machines are faster and disk space 
cheaper than they were at the time, but I'd still question the wisdom of using 
a Postgres row for each daily bar, let alone finer-grain data.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Prefetch - OffTopic

2005-05-10 Thread Chris Browne
[EMAIL PROTECTED] (Mohan, Ross) writes:
 for time-series and insane fast, nothing beats kdB, I believe

 www.kx.com

... Which is well and fine if you're prepared to require that all of
the staff that interact with data are skilled APL hackers.  Skilled
enough that they're all ready to leap into Whitney's ASCII-based
variant, K.
-- 
(format nil [EMAIL PROTECTED] cbbrowne acm.org)
http://www.ntlug.org/~cbbrowne/functional.html
Rules of the Evil Overlord #78.  I will not tell my Legions of Terror
And he must  be taken alive! The command will be:  ``And try to take
him alive if it is reasonably practical.''
http://www.eviloverlord.com/

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Prefetch - OffTopic

2005-05-10 Thread Mohan, Ross
Yes, that would be a sufficient (although not necessary) condition for being 
well and fine with kdB. 
Last time I used APL was.pre-Gregorian, so yea, that's scary to me, too. 

( Of course, one can use C/ODBC or Java/JDBC to reach kdB; once there, one uses 
SQL92, or
proprietary kSQL. )



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Chris Browne
Sent: Tuesday, May 10, 2005 4:14 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Prefetch - OffTopic


[EMAIL PROTECTED] (Mohan, Ross) writes:
 for time-series and insane fast, nothing beats kdB, I believe

 www.kx.com

... Which is well and fine if you're prepared to require that all of the staff 
that interact with data are skilled APL hackers.  Skilled enough that they're 
all ready to leap into Whitney's ASCII-based variant, K.
-- 
(format nil [EMAIL PROTECTED] cbbrowne acm.org) 
http://www.ntlug.org/~cbbrowne/functional.html
Rules of the Evil Overlord #78.  I will not tell my Legions of Terror And he 
must  be taken alive! The command will be:  ``And try to take him alive if it 
is reasonably practical.'' http://www.eviloverlord.com/

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Prefetch

2005-05-10 Thread Sam Vilain
Matt Olson wrote:
Other databases like Oracle and DB2 implement some sort of row prefetch.  Has
there been serious consideration of implementing something like a prefetch
subsystem?  Does anyone have any opinions as to why this would be a bad idea
for postgres?
Postges is great for a multiuser environment and OLTP applications.  However,
in this set up, a data warehouse, the observed performance is not what I
would hope for.
Oracle doesn't pre-fetch data to get its fast results in this case.
pre-fetching doesn't give you the 100 times speed increases.
Bitmap indexes are very important for data mining.  You might want to see
   http://www.it.iitb.ac.in/~rvijay/dbms/proj/
I have no idea how well developed this is, but this is often the biggest
win with Data Warehousing.  If it works, you'll get results back in seconds,
if it doesn't you'll have plenty of time while your queries run to reflect on
the possibility that commercial databases might actually have important features
that haven't even penetrated the awareness of most free database developers.
Another trick you can use with large data sets like this when you want results
back in seconds is to have regularly updated tables that aggregate the data
along each column normally aggregated against the main data set.
Of couse, Pg doesn't have the nice features that make this just work and make
queries against your data source faster (called OLAP Query rewrite in
Oracle), so you'll have to put a lot of work into changing your application to
figure out when to use the summary tables.  As far as I know it doesn't have
materialized views, either, so updating these summary tables is also a more
complex task than just a single REFRESH command.
Maybe some bright person will prove me wrong by posting some working
information about how to get these apparently absent features working.
You might also want to consider ditching RAID 5 and switching to plain
mirroring.  RAID 5 is a helluva performance penalty (by design).  This is
why they said RAID - fast, cheap, reliable - pick any two.  RAID 5 ain't
fast.  But that's probably not your main problem.
Sam.
Regards,
Matt Olson
Ocean Consulting
http://www.oceanconsulting.com/
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])