Re: [PERFORM] [GENERAL] PostgreSQL - case studies

2010-02-10 Thread Ing. Marcos L. Ortiz Valmaseda

El 10/02/2010 6:49, Scott Marlowe escribió:

Quick note, please stick to text formatted email for the mailing list,
it's the preferred format.

On Tue, Feb 9, 2010 at 9:09 PM, Jayadevan M
jayadevan.maym...@ibsplc.com  wrote:
   

Hello all,
Apologies for the long mail.
I work for a company that is provides solutions mostly on a Java/Oracle 
platform. Recently we moved on of our products to PostgreSQL. The main reason 
was PostgreSQL's GIS capabilities and the inability of government departments 
(especially road/traffic) to spend a lot of money for such projects. This 
product is used to record details about accidents and related analysis (type of 
road, when/why etc) with maps. Fortunately, even in India, an accident 
reporting application does not have to handle many tps :).  So, I can't say 
PostgreSQL's performance was really tested in this case.
Later, I tested one screen of one of our products - load testing with Jmeter. 
We tried it with Oracle, DB2, PostgreSQL and Ingres, and PostgreSQL easily 
out-performed the rest. We tried a transaction mix with 20+ SELECTS, update, 
delete and a few inserts.
 

Please note that benchmarking oracle (and a few other commercial dbs)
and then publishing those results without permission of oracle is
considered to be in breech of their contract.  Yeah, another wonderful
aspect of using Oracle.

That said, and as someone who is not an oracle licensee in any way,
this mimics my experience that postgresql is a match for oracle, db2,
and most other databases in the simple, single db on commodity
hardware scenario.

   

After a really good experience with the database, I subscribed to all 
PostgreSQL groups (my previous experience is all-Oracle) and reading these 
mails, I realized that many organizations are using plan, 'not customized'  
PostgreSQL for databases that handle critical applications.  Since there is no 
company trying to 'sell' PostgreSQL, many of us are not aware of such cases.
 

Actually there are several companies that sell pgsql service, and some
that sell customized versions.  RedHat, Command Prompt, EnterpriseDB,
and so on.

   

Could some of you please share some info on such scenarios- where you are 
supporting/designing/developing databases that run into at least a few hundred 
GBs of data (I know, that is small by todays' standards)?
 

There are other instances of folks on the list sharing this kind of
info you can find by searching the archives.  I've used pgsql for
about 10 years for anywhere from a few megabytes to hundreds of
gigabytes, and all kinds of applications.

Where I currently work we have a main data store for a web app that is
about 180Gigabytes and growing, running on three servers with slony
replication. We handle somewhere in the range of 10k to 20k queries
per minute (a mix of 90% or so reads to 10% writes).  Peak load can be
into the 30k or higher reqs / minute.

The two big servers that handle this load are dual quad core opteron
2.1GHz machines with 32Gig RAM and 16 15krpm SAS drives configured as
2 in RAID-1 for OS and pg_xlog, 2 hot spares, and 12 in a RAID-10 for
the main data.  HW Raid controller is the Areca 1680 which is mostly
stable, except for the occasional (once a year or so) hang problem
which has been described, and which Areca has assured me they are
working on.

Our total downtime due to database outages in the last year or so has
been 10 to 20 minutes, and that was due to a RAID card driver bug that
hits us about once every 300 to 400 days.  the majority of the down
time has been waiting for our hosting provider to hit the big red
switch and restart the main server.

Our other pgsql servers provide search facility, with a db size of
around 300Gig, and statistics at around ~1TB.

   

I am sure PostgreSQL has matured a lot more from the days when these case 
studies where posted. I went through the case studies at EnterpiseDB and 
similar vendors too. But those are customized PostgreSQL servers.
 

Not necessarily.  They sell support more than anything, and the
majority of customization is not for stability but for additional
features, such as mpp queries or replication etc.

The real issue you run into is that many people don't want to tip
their hand that they are using pgsql because it is a competitive
advantage.  It's inexpensive, capable, and relatively easy to use.  If
your competitor is convinced that Oracle or MSSQL server with $240k in
licensing each year is the best choice, and you're whipping them with
pgsql, the last thing you want is for them to figure that out and
switch.

   
Following with that subject, there are many apps on the world that are 
using PostgreSQL for its business.
We are planning the design and deployment of the a large PostgreSQL 
Cluster for a DWH-ODS-BI apps.
We are documenting everthing for give the information later to be 
published on the PostgreSQL CaseStudies section.


We are using Slony-I for replication, PgBouncer for pooling 

Re: [PERFORM] renice on an I/O bound box

2010-01-19 Thread Ing. Marcos L. Ortiz Valmaseda

El 19/01/2010 13:59, Willy-Bas Loos escribió:

Hi,

I have a query that runs for about 16 hours, it should run at least 
weekly.
There are also clients connecting via a website, we don't want to keep 
them waiting because of long DSS queries.


We use Debian Lenny.
I've noticed that renicing the process really lowers the load (in 
top), though i think we are I/O bound. Does that make any sense?


Cheers,

WBL
--
Patriotism is the conviction that your country is superior to all 
others because you were born in it. -- George Bernard Shaw

¿16 hours?
¿Which the amount of data? of 10  to 30 000 000 of records no?

¿Do you put the code here to see if we can help you on its optimization?

About the question, you can give more information with iostat.
Regards


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Re: New server to improve performance on our large and busy DB - advice? (v2)

2010-01-15 Thread Ing. Marcos L. Ortiz Valmaseda

El 15/01/2010 14:43, Ivan Voras escribió:

hi,

You wrote a lot of information here so let's confirm in a nutshell 
what you have and what you are looking for:


* A database that is of small to medium size (5 - 10 GB)?
* Around 10 clients that perform constant write operations to the 
database (UPDATE/INSERT)

* Around 10 clients that occasionally read from the database
* Around 6000 tables in your database
* A problem with tuning it all
* Migration to new hardware and/or OS

Is this all correct?

First thing that is noticeable is that you seem to have way too few 
drives in the server - not because of disk space required but because 
of speed. You didn't say what type of drives you have and you didn't 
say what you would consider desirable performance levels, but off hand 
(because of the 10 clients perform constant writes part) you will 
probably want at least 2x-4x more drives.


 1) Which RAID level would you recommend

With only 4 drives, RAID 10 is the only thing usable here.

 2) Which Windows OS would you recommend? (currently 2008 x64 Server)

Would not recommend Windows OS.

 3) If we were to port to a *NIX flavour, which would you recommend? 
(which

 support trouble-free PG builds/makes please!)

Practically any. I'm biased for FreeBSD, a nice and supported version 
of Linux will probably be fine.


 4) Is this the right PG version for our needs?

If you are starting from scratch on a new server, go for the newest 
version you can get - 8.4.2 in this case.


Most importantly, you didn't say what you would consider desirable 
performance. The hardware and the setup you described will work, but 
not necessarily fast enough.


 . So far, we have never seen a situation where a seq scan has improved
 performance, which I would attribute to the size of the tables

... and to the small number of drives you are using.

 . We believe our requirements are exceptional, and we would benefit
 immensely from setting up the PG planner to always favour 
index-oriented decisions


Have you tried decreasing random_page_cost in postgresql.conf? Or 
setting (as a last resort) enable_seqscan = off?



Carlo Stonebanks wrote:
My client just informed me that new hardware is available for our DB 
server.


. Intel Core 2 Quads Quad
. 48 GB RAM
. 4 Disk RAID drive (RAID level TBD)

I have put the ugly details of what we do with our DB below, as well 
as the

postgres.conf settings. But, to summarize: we have a PostgreSQL 8.3.6 DB
with very large tables and the server is always busy serving a constant
stream of single-row UPDATEs and INSERTs from parallel automated 
processes.


There are less than 10 users, as the server is devoted to the KB 
production

system.

My questions:

1) Which RAID level would you recommend
2) Which Windows OS would you recommend? (currently 2008 x64 Server)
3) If we were to port to a *NIX flavour, which would you recommend? 
(which

support trouble-free PG builds/makes please!)
4) Is this the right PG version for our needs?

Thanks,

Carlo

The details of our use:

. The DB hosts is a data warehouse and a knowledgebase (KB) tracking the
professional information of 1.3M individuals.
. The KB tables related to these 130M individuals are naturally also 
large
. The DB is in a perpetual state of serving TCL-scripted Extract, 
Transform

and Load (ETL) processes
. These ETL processes typically run 10 at-a-time (i.e. in parallel)
. We would like to run more, but the server appears to be the bottleneck
. The ETL write processes are 99% single row UPDATEs or INSERTs.
. There are few, if any DELETEs
. The ETL source data are import tables
. The import tables are permanently kept in the data warehouse so 
that we

can trace the original source of any information.
. There are 6000+ and counting
. The import tables number from dozens to hundreds of thousands of rows.
They rarely require more than a pkey index.
. Linking the KB to the source import date requires an audit table 
of 500M

rows, and counting.
. The size of the audit table makes it very difficult to manage, 
especially

if we need to modify the design.
. Because we query the audit table different ways to audit the ETL 
processes

decisions, almost every column in the audit table is indexed.
. The maximum number of physical users is 10 and these users RARELY 
perform

any kind of write
. By contrast, the 10+ ETL processes are writing constantly
. We find that internal stats drift, for whatever reason, causing row 
seq

scans instead of index scans.
. So far, we have never seen a situation where a seq scan has improved
performance, which I would attribute to the size of the tables
. We believe our requirements are exceptional, and we would benefit
immensely from setting up the PG planner to always favour index-oriented
decisions - which seems to contradict everything that PG advice 
suggests as

best practice.

Current non-default conf settings are:

autovacuum = on
autovacuum_analyze_scale_factor = 0.1
autovacuum_analyze_threshold = 250