Re: [PERFORM] Mostly read performance

2005-08-12 Thread Greg Stark

Michael Stone [EMAIL PROTECTED] writes:

  Well, that's what you'd expect.  But a first time 70MB fetch on a freshly
  rebooted system took just as long as all secondary times.  (Took over a
  minute to fetch, which is too long for my needs, at least on secondary
  attempts).

That's not impressively fast even for the disk. You should get up to about
40Mbit/s or 5MByte/s from the disk. Add some overhead for postgres; so I would
expect a full table scan of 70MB to take more like 15-30s, not over a minute.

What is your shared_buffers setting? Perhaps you have it set way too high or
way too low?

Also, you probably should post the explain analyze output of the actual
query you're trying to optimize. Even if you're not looking for a better plan
having hard numbers is better than guessing.

And the best way to tell if the data is cached is having a vmstat 1 running
in another window. Start the query and look at the bi/bo columns. If you see
bi spike upwards then it's reading from disk.

-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Mostly read performance (2 replies)

2005-08-12 Thread Jeffrey Tenny
(Pardon my replying two two replies at once, I only get the digest and 
this was easier).


Michael Stone wrote:
[...]

Well, that's what you'd expect.  But a first time 70MB fetch on a freshly 
rebooted system took just as long as all secondary times.  (Took over a minute 
to fetch, which is too long for my needs, at least on secondary attempts).



If the query involves a table scan and the data set is larger than your
available memory, you'll need a full scan every time. If you do a table
scan and the table fits in RAM, subsequent runs should be faster. If you
have an index and only need to look at a subset of the table, subsequent
runs should be faster. Without knowing more about your queries it's not
clear what your situation is.


I must amend my original statement.  I'm not using a parameterized 
statement.  The system is effectively fetching file content stored in 
the database for portions of one or more files.  It attempts to batch
the records being fetched into as few non-parameterized queries as 
possible, while balancing the rowset retrieval memory impact.


Currently that means it will request up to 16K records in a query that 
is assembled using a combination of IN (recids...) , BETWEEN ranges, and

UNION ALL for multiple file IDs.  I do this to minimize the latency of
dbclient/dbserver requests, while at the same time capping the maximum 
data returned by a DBIO to about 1.2MB per maximum retrieved record set.

(I'm trying not to pound the java app server via jdbc memory usage).
There's an ORDER BY on the file id column too.

It sounds like a simple enough thing to do, but this pieces of many 
files in a database problem is actually pretty hard to optimize.
Fetching all records for all files, even though I don't need all 
records, is both inefficient and likely to use too much memory. 
Fetching 1 file at a time is likely to result in too many queries 
(latency overhead).  So right now I err on the side of large but record 
limited queries.  That let's me process many files in one query, unless 
the pieces of the files I need are substantial.
(I've been burned by trying to use setFetchSize so many times it isn't 
funny, I never count on that any more).


An index is in place to assist with record selection, I'll double check 
that it's being used.  It's a joint index on file-id and 
record-id-within-the-file.  I'll check to be sure it's being used.





Greg Stark wrote:
[...]


What is your shared_buffers setting? Perhaps you have it set way too high or
way too low?


I generally run with the conservative installation default.  I did some 
experimenting with larger values but didn't see any improvement (and 
yes, I restarted postmaster).  This testing was done a while ago, I 
don't have the numbers in memory any more so I can't tell you what they 
were.




Also, you probably should post the explain analyze output of the actual
query you're trying to optimize. Even if you're not looking for a better plan
having hard numbers is better than guessing.


A good suggestion.  I'll look into it.



And the best way to tell if the data is cached is having a vmstat 1 running
in another window. Start the query and look at the bi/bo columns. If you see
bi spike upwards then it's reading from disk.


Another good suggestion.

I'll look into getting further data from the above suggestions.

I'm also looking into getting a gig or two of ram to make sure that 
isn't an issue.


The basis of my post originally was to make sure that, all things being 
equal, there's no reason those disk I/Os on behalf of the database 
shouldn't be cached by the operating/file system so that repeated reads 
might benefit from in-memory data.



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Mostly read performance

2005-08-11 Thread John A Meinel
Jeffrey Tenny wrote:
 I have a largely table-append-only application where most transactions
 are read-intensive and many are read-only.  The transactions may span
 many tables, and in some cases might need to pull 70 MB of data out of a
 couple of the larger tables.


 In 7.3, I don't seem to see any file system or other caching that helps
 with repeated reads of the 70MB of data.  Secondary fetches are pretty
 much as slow as the first fetch. (The 70MB in this example might take
 place via 2000 calls to a parameterized statement via JDBC).

 Were there changes after 7.3 w.r.t. caching of data? I read this list
 and see people saying that 8.0 will use the native file system cache to
 good effect.  Is this true? Is it supposed to work with 7.3?  Is there
 something I need to do to get postgresql to take advatage of large ram
 systems?

 Thanks for any advice.


Well, first off, the general recommendation is probably that 7.3 is
really old, and you should try to upgrade to at least 7.4, though
recommended to 8.0.

The bigger questions: How much RAM do you have? How busy is your system?

8.0 doesn't really do anything to do make the system cache the data.
What kernel are you using?

Also, if your tables are small enough, and your RAM is big enough, you
might already have everything cached.

One way to flush the caches, is to allocate a bunch of memory, and then
scan through it. Or maybe mmap a really big file, and access every byte.
But if your kernel is smart enough, it could certainly deallocate pages
after you stopped accessing them, so I can't say for sure that you can
flush the memory cache. Usually, I believe these methods are sufficient.

John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Mostly read performance

2005-08-11 Thread Jeffrey Tenny

John A Meinel wrote:
  Well, first off, the general recommendation is probably that 7.3 is

really old, and you should try to upgrade to at least 7.4, though
recommended to 8.0.


There have been issues with each release that led me to wait.
Even now I'm waiting for some things to settle in the 8.0 JDBC driver
(timezones), and 7.3 has behaved well for me.  But yes, I'd like to upgrade.



The bigger questions: How much RAM do you have? How busy is your system?


The system for testing was 512MB.  I'm in the process of buying some 
additional memory.  However there was no swap activity on that system, 
so I doubt memory was the limiting factor.




8.0 doesn't really do anything to do make the system cache the data.
What kernel are you using?


2.4.X for various large x.  (Multiple systems).  Gonna try 2.6.x soon.



Also, if your tables are small enough, and your RAM is big enough, you
might already have everything cached.


Well, that's what you'd expect.  But a first time 70MB fetch on a 
freshly rebooted system took just as long as all secondary times.  (Took 
over a minute to fetch, which is too long for my needs, at least on 
secondary attempts).



One way to flush the caches, is to allocate a bunch of memory, and then
scan through it. Or maybe mmap a really big file, and access every byte.
But if your kernel is smart enough, it could certainly deallocate pages
after you stopped accessing them, so I can't say for sure that you can
flush the memory cache. Usually, I believe these methods are sufficient.


Not sure how that would really help.  It doesn't seem like the database 
or file system is caching the table content either way, which led me to 
this inquiry.


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Mostly read performance

2005-08-11 Thread Michael Stone

On Thu, Aug 11, 2005 at 07:13:27PM -0400, Jeffrey Tenny wrote:

The system for testing was 512MB


That's definately *not* a large ram system. If you're reading a subset
of data that totals 70MB I'm going to guess that your data set is larger
than or at least a large fraction of 512MB.

additional memory.  However there was no swap activity on that system, 
so I doubt memory was the limiting factor.


The system won't swap if your data set is larger than your memory, it
just won't cache the data.

Well, that's what you'd expect.  But a first time 70MB fetch on a 
freshly rebooted system took just as long as all secondary times.  (Took 
over a minute to fetch, which is too long for my needs, at least on 
secondary attempts).


If the query involves a table scan and the data set is larger than your
available memory, you'll need a full scan every time. If you do a table
scan and the table fits in RAM, subsequent runs should be faster. If you
have an index and only need to look at a subset of the table, subsequent
runs should be faster. Without knowing more about your queries it's not
clear what your situation is.

Mike Stone

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match