Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-05-01 Thread Josh Berkus
Greg,

 1) Collect up data about their system (memory, disk layout), find out a
 bit about their apps/workload, and generate a config file based on that.

We could start with this.  Where I bogged down is that collecting system 
information about several different operating systems ... and in some cases 
generating scripts for boosting things like shmmax ... is actually quite a 
large problem from a slog perspective; there is no standard way even within 
Linux to describe CPUs, for example. Collecting available disk space 
information is even worse.   So I'd like some help on this portion.

I actually have algorithms which are good enough to start with for most of 
the important GUCs worked out, and others could be set through an interactive 
script (Does your application require large batch loads involving thousands 
or millions of updates in the same transaction?  How large (GB) do you 
expect your database to be?)

 2) Connect to the database and look around.  Study the tables and some
 their stats, make some estimates based on what your find, produce a new
 config file.

I'm not sure that much more for (2) can be done than for (1).  Tables-on-disk 
don't tell us much.  

 3) Monitor the database while it's doing its thing.  See which parts go
 well and which go badly by viewing database statistics like pg_statio.
 From that, figure out where the bottlenecks are likely to be and push more
 resources toward them.  What I've been working on lately is exposing more
 readouts of performance-related database internals to make this more
 practical.

We really should collaborate on that.  

 When first exposed to this problem, most people assume that (1) is good
 enough--ask some questions, look at the machine, and magically a
 reasonable starting configuration can be produced.  It's already been
 pointed out that anyone with enough knowledge to do all that can probably
 spit out a reasonable guess for the config file without help.  

But that's actually more than most people already do.  Further, if you don't 
start with a reasonable configuration, then it's difficult-impossible to 
analyze where your settings are out-of-whack; behavior introduced by some 
way-to-low settings will mask any other tuning that needs to be done.  It's 
also hard/impossible to devise tuning algorithms that work for both gross 
tuning (increase shared_buffers by 100x) and fine tuning (decrease 
bgwriter_interval to 45ms).

So whether or not we do (3), we need to do (1) first.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] sytem log audit/reporting and psql

2007-05-01 Thread Andreas Haumer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi!

Fei Liu schrieb:
[...]
 
 Now here are my questions:

These are a lot of questions, and some of them are not related
to pgsql-performance or even PostgreSQL.

I'll try to answer some of them, because I'm currently experimenting
with partitioned tables, too.


 2) Is partitioning a good approach to speed up log query/view? The user
 comment in partitioning in pgsql manual seems to indicate partitioning
 may be slower than non-partitioned table under certain circumstances.

You can look at table partitioning under several points of view.
My experience with table partitioning under the aspect of performance is:

*) It is a benefit for large tables only, and the definition of
   large depends on your data.
   I did some testing for an application we are developing and here
   it shows that table partitioning does not seem to make sense for
   tables with less than 10 million rows (perhaps even more)

*) The performance benefit depends on your queries. Some queries get
   a big improvement, but some queries might even run significantly
   slower.

*) Depending on the way you setup your system, inserts can be much
   slower with partitioned tables (e.g. if you are using triggers
   to automatically create your partitions on demand)

 3) How to avoid repetitive log entry scanning since my cron job script
 is run daily but logrotate runs weekly? This means everytime my script
 will be parsing duplicate entries.

This has nothing to do with postgres, but I wrote something similar years
ago. Here's what I did and what you could do:

Remember the last line of your logfile in some external file (or even the
database). Then on the next run you can read the logfile again line by line
and skip all lines until you have found the line you saved on the last run.
* If you find the line that way, just start parsing the logfile beginning
  at the next line.
* If you can not find your line and you reach EOF, start parsing again at
  the beginning of the logfile.
* If this is your first run and you don't have a line stored yet, start
  parsing at the beginning of the logfile

When you are finished you have to remember the last line from the logfile
again at some place.


 6) What are the best approaches to analyze postgresql query performance
 and how to improve postgresql query performance?

Here are some general recommendations for performance testing from
my experience:

*) Test with real data. For table partitioning this means you have
   to create really large datasets to make your tests useful.
   You should write a small program to generate your test data if
   possible or use some other means to create your test database.
   You also need time: creating a test database and importing 100
   million rows of test data will take several hours or even days!

*) Test with the real queries from your application!
   Testing with just a few easy standard queries will almost for sure
   not be sufficient to get the right numbers for the performance you
   will see in your application later on!
   Look at the thread Query performance problems with partitioned tables
   I started on pgsql-performance just yesterday to see what I mean!

*) Use EXPLAIN ANALYZE and look at the cost and actual time numbers
   this gives you. It also will show you the query plan used by PostgreSQL
   when executing your query. You sometimes might be surprised what is going
   on behind the scenes...

*) If you are just using some stopwatch to time your queries be aware
   of other factors which might significantly influence your test:
   Caching, other jobs running on the machine in parallel, cosmic rays, ...

*) Before running your tests you should always try to get to some well
   defined starting point (this might even mean rebooting your server
   before running each test) and you should always repeat each test
   several times and then calculate a mean value (and standard deviation
   to see how good your results are...)

*) Document your test setup and procedure as well as your results
   (otherwise two days later you won't remember which test obtained
   what result)

HTH

- - andreas

- --
Andreas Haumer | mailto:[EMAIL PROTECTED]
*x Software + Systeme  | http://www.xss.co.at/
Karmarschgasse 51/2/20 | Tel: +43-1-6060114-0
A-1100 Vienna, Austria | Fax: +43-1-6060114-71
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGN3NIxJmyeGcXPhERAspaAJ9MgymiwyehN6yU6jGtA0pbkdolsACfb6JC
kB5KLyQ5WOTUD9uabVzsjwY=
=3QSa
-END PGP SIGNATURE-

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] index structure for 114-dimension vector

2007-05-01 Thread Andrew Lazarus
Let me just thank the list, especially for the references. (I found
similar papers myself with Google: and to think I have a university
library alumni card and barely need it any more!)

I'll write again on the sorts of results I get.
BEGIN:VCARD
VERSION:2.1
N:Lazarus;Andrew;;;Ph.D.
FN:Andrew Lazarus, Ph.D.
EMAIL;PREF;INTERNET:[EMAIL PROTECTED]
TITLE:Director of RD
ADR;WORK:;800-366-0688;3028 Fillmore Street;San Francisco;CA;94123;USA
LABEL;WORK;ENCODING=QUOTED-PRINTABLE:800-366-0688=0D=0A3028 Fillmore S=
 treet=0D=0ASan Francisco=0D=0ACA=0D=0A94123=0D=0AUSA
X-GENDER:Male
REV:18991230T08Z
END:VCARD
---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] index structure for 114-dimension vector

2007-05-01 Thread Alexander Staubo

On 5/1/07, Andrew Lazarus [EMAIL PROTECTED] wrote:

Let me just thank the list, especially for the references. (I found
similar papers myself with Google: and to think I have a university
library alumni card and barely need it any more!)

I'll write again on the sorts of results I get.


Looking forward to hearing about them. I have worked with such dataset
problems, but never attempted to apply them to a relational database
such as PostgreSQL. If you want speed, nothing beats in-memory vectors
on a modern CPU architecture.

Alexander.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-05-01 Thread Carlos Moreno


large problem from a slog perspective; there is no standard way even within 
Linux to describe CPUs, for example. Collecting available disk space 
information is even worse.   So I'd like some help on this portion.
  


Quite likely, naiveness follows...  But, aren't things like /proc/cpuinfo ,
/proc/meminfo, /proc/partitions / /proc/diskstats standard, at the very
least across Linux distros?  I'm not familiar with BSD or other Unix
flavours, but I would expect these (or their equivalent) to exist in those,
no?

Am I just being naive?

Carlos
--


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-05-01 Thread david

On Tue, 1 May 2007, Carlos Moreno wrote:


 large problem from a slog perspective; there is no standard way even
 within Linux to describe CPUs, for example. Collecting available disk
 space information is even worse.   So I'd like some help on this portion.



Quite likely, naiveness follows...  But, aren't things like /proc/cpuinfo ,
/proc/meminfo, /proc/partitions / /proc/diskstats standard, at the very
least across Linux distros?  I'm not familiar with BSD or other Unix
flavours, but I would expect these (or their equivalent) to exist in those,
no?

Am I just being naive?


unfortunantly yes.

across different linux distros they are fairly standard (however different 
kernel versions will change them)


however different kernels need drasticly different tools to get the info 
from them.


David Lang

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-05-01 Thread Greg Smith

On Mon, 30 Apr 2007, Kevin Hunter wrote:

I recognize that PostgreSQL and MySQL try to address different 
problem-areas, but is this one reason why a lot of people with whom I 
talk prefer MySQL? Because PostgreSQL is so slooow out of the box?


It doesn't help, but there are many other differences that are as big or 
bigger.  Here are a few samples off the top of my head:


1) Performance issues due to MVCC (MySQL fans love to point out how 
fast they can do select count(*) from x)
2) Not knowing you have to run vacuum analyze and therefore never seeing a 
good query result
3) Unfair comparison of PostgreSQL with robust WAL vs. MySQL+MyISAM on 
write-heavy worksloads


These are real issues, which of course stack on top of things like 
outdated opinions from older PG releases with performance issues resolved 
in the last few years.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-05-01 Thread Greg Smith

On Tue, 1 May 2007, Josh Berkus wrote:

there is no standard way even within Linux to describe CPUs, for 
example. Collecting available disk space information is even worse.  So 
I'd like some help on this portion.


I'm not fooled--secretly you and your co-workers laugh at how easy this is 
on Solaris and are perfectly happy with how difficult it is on Linux, 
right?


I joke becuase I've been re-solving some variant on this problem every few 
years for a decade now and it just won't go away.  Last time I checked the 
right answer was to find someone else who's already done it, packaged that 
into a library, and appears committed to keeping it up to date; just pull 
a new rev of that when you need it.  For example, for the CPU/memory part, 
top solves this problem and is always kept current, so on open-source 
platforms there's the potential to re-use that code.  Now that I know 
that's one thing you're (understandably) fighting with I'll dig up my 
references on that (again).


It's also hard/impossible to devise tuning algorithms that work for both 
gross tuning (increase shared_buffers by 100x) and fine tuning (decrease 
bgwriter_interval to 45ms).


I would advocate focusing on iterative improvements to an existing 
configuration rather than even bothering with generating a one-off config 
for exactly this reason.  It *is* hard/impossible to get it right in a 
single shot, because of how many parameters interact and the way 
bottlenecks clear, so why not assume from the start you're going to do it 
several times--then you've only got one piece of software to write.


The idea I have in my head is a tool that gathers system info, connects to 
the database, and then spits out recommendations in order of expected 
effectiveness--with the specific caveat that changing too many things at 
one time isn't recommended, and some notion of parameter dependencies. 
The first time you run it, you'd be told that shared_buffers was wildly 
low, effective_cache_size isn't even in the right ballpark, and your 
work_mem looks small relative to the size of your tables; fix those before 
you bother doing anything else because any data collected with those at 
very wrong values is bogus.  Take two, those parameters pass their sanity 
tests, but since you're actually running at a reasonable speed now the 
fact that your tables are no longer being vacuumed frequently enough might 
bubble to the top.


It would take a few passes through to nail down everything, but as long as 
it's put together such that you'd be in a similar position to the 
single-shot tool after running it once it would remove that as something 
separate that needed to be built.


To argue against myself for a second, it may very well be the case that 
writing the simpler tool is the only way to get a useful prototype for 
building the more complicated one; very easy to get bogged down in feature 
creep on a grand design otherwise.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-05-01 Thread david

On Tue, 1 May 2007, Greg Smith wrote:


On Tue, 1 May 2007, Josh Berkus wrote:


 there is no standard way even within Linux to describe CPUs, for example.
 Collecting available disk space information is even worse.  So I'd like
 some help on this portion.


what type of description of the CPU's are you looking for?


 It's also hard/impossible to devise tuning algorithms that work for both
 gross tuning (increase shared_buffers by 100x) and fine tuning (decrease
 bgwriter_interval to 45ms).


I would advocate focusing on iterative improvements to an existing 
configuration rather than even bothering with generating a one-off config for 
exactly this reason.  It *is* hard/impossible to get it right in a single 
shot, because of how many parameters interact and the way bottlenecks clear, 
so why not assume from the start you're going to do it several times--then 
you've only got one piece of software to write.


nobody is asking for things to be right the first time.

The idea I have in my head is a tool that gathers system info, connects to 
the database, and then spits out recommendations in order of expected 
effectiveness--with the specific caveat that changing too many things at one 
time isn't recommended, and some notion of parameter dependencies. The first 
time you run it, you'd be told that shared_buffers was wildly low, 
effective_cache_size isn't even in the right ballpark, and your work_mem 
looks small relative to the size of your tables; fix those before you bother 
doing anything else because any data collected with those at very wrong 
values is bogus.


why not have a much simpler script that gets these values up into the 
right ballpark first? then after that the process and analysis that you 
are suggesting would be useful. the problem is that the defaults are _so_ 
far off that no sane incremental program is going to be able to converge 
on the right answer rapidly.


David Lang

 Take two, those parameters pass their sanity tests, but 
since you're actually running at a reasonable speed now the fact that your 
tables are no longer being vacuumed frequently enough might bubble to the 
top.


It would take a few passes through to nail down everything, but as long as 
it's put together such that you'd be in a similar position to the single-shot 
tool after running it once it would remove that as something separate that 
needed to be built.


To argue against myself for a second, it may very well be the case that 
writing the simpler tool is the only way to get a useful prototype for 
building the more complicated one; very easy to get bogged down in feature 
creep on a grand design otherwise.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings