Re: [PERFORM] pg_dump and pg_restore

2010-05-22 Thread Peter Koczan
On Mon, May 17, 2010 at 12:04 AM, Jayadevan M
jayadevan.maym...@ibsplc.com wrote:
 Hello all,
 I was testing how much time a pg_dump backup would take to get restored.
 Initially, I tried it with psql (on a backup taken with pg_dumpall). It took
 me about one hour. I felt that I should target for a recovery time of 15
 minutes to half an hour. So I went through the blogs/documentation etc and
 switched to pg_dump and pg_restore. I tested only the database with the
 maximum volume of data (about 1.5 GB). With
 pg_restore -U postgres -v -d PROFICIENT --clean -Fc proficient.dmp
 it took about 45 minutes. I tried it with
 pg_restore -U postgres -j8 -v -d PROFICIENT --clean -Fc proficient.dmp
 Not much improvement there either. Have I missed something or 1.5 GB data on
 a machine with the following configuration will take about 45 minutes? There
 is nothing else running on the machine consuming memory or CPU. Out of 300
 odd tables, about 10 tables have millions of records, rest are all having a
 few thousand records at most.

 Here are the specs  ( a pc class  machine)-

 PostgreSQL 8.4.3 on i686-pc-linux-gnu
 CentOS release 5.2
 Intel(R) Pentium(R) D CPU 2.80GHz
 2 GB RAM
 Storage is local disk.

 Postgresql parameters (what I felt are relevant) -
 max_connections = 100
 shared_buffers = 64MB
 work_mem = 16MB
 maintenance_work_mem = 16MB
 synchronous_commit on

Do the big tables have lots of indexes? If so, you should raise
maintenance_work_mem.

Peter

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] postgresql is slow with larger table even it is in RAM

2008-03-26 Thread Peter Koczan
On Tue, Mar 25, 2008 at 3:35 AM, sathiya psql [EMAIL PROTECTED] wrote:
 Dear Friends,
  I have a table with 32 lakh record in it. Table size is nearly 700 MB,
 and my machine had a 1 GB + 256 MB RAM, i had created the table space in
 RAM, and then created this table in this RAM.

 So now everything is in RAM, if i do a count(*) on this table it returns
 327600 in 3 seconds, why it is taking 3 seconds ? because am sure that
 no Disk I/O is happening. ( using vmstat i had confirmed, no disk I/O is
 happening, swap is also not used )

 Any Idea on this ???

 I searched a lot in newsgroups ... can't find relevant things ( because
 everywhere they are speaking about disk access speed, here i don't want to
 worry about disk access )

  If required i will give more information on this.

Two things:

- Are you VACUUM'ing regularly? It could be that you have a lot of
dead rows and the table is spread out over a lot of pages of mostly
dead space. That would cause *very* slow seq scans.

- What is your shared_buffers set to? If it's really low then postgres
could be constantly swapping from ram-disk to memory. Not much would
be cached, and performance would suffer.

FWIW, I did a select count(*) on a table with just over 30 rows,
and it only took 0.28 sec.

Peter

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Anyone using a SAN?

2008-03-19 Thread Peter Koczan
  Dell acquired Equallogic last November/December.

  I noticed your Dell meeting was a Dell/EMC meeting. Have you talked to them
 or anyone else about Equallogic?

Now that you mention it, I do recall a bit about Equalogic in the Dell
pitch. It didn't really stand out in my mind and a lot of the
technical details were similar enough to the EMC details that they
just melded in my mind.

  When I was looking at iSCSI solutions, the Equallogic was really slick. Of
 course, I needed high-end performance, which of course came at a steep
 price, and the project got canned. Oh well. Still, the EL solution claimed
 near linear scalability when additional capacity/shelves were added. And,
 they have a lot of really nice technologies for managing the system.

If you think Equalogic is slick, check out 3par. They've got a lot of
very cool features and concepts. Unfortunately, this comes at a higher
price. To each his own, I guess.

Our meetings didn't focus a lot on scalability of capacity, as we just
didn't think to ask. I think the basic pitch was it scales well
without any real hard data.

Peter

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] What is the best way to storage music files in Postgresql

2008-03-17 Thread Peter Koczan
   I am going to embarkon building a music library using apache,
   postgresql and php.  What is the best way to store the music files?

  Your options are either to use a BLOB within the database or to store
  paths to normal files in the file system in the database. I suspect
  using normal files will make backup and management a great deal easier
  than using in-database BLOBs, so personally I'd do it that way.

I discussed something like this with some co-workers recently, and
here's what I had to say. Not all of these apply to the original
message, but they are things to consider when marrying a database to a
file storage system.

Storing the files in the database as BLOBs:
Pros:
- The files can always be seen by the database system as long as it's
up (there's no dependence on an external file system).
- There is one set of locking mechanisms, meaning that the file
operations can be atomic with the database operations.
- There is one set of permissions to deal with.
Cons:
- There is almost no way to access files outside of the database. If
the database goes down, you are screwed.
- If you don't make good use of tablespaces and put blobs on a
separate disk system, the disk could thrash going between data and
blobs, affecting performance.
- There are stricter limits for PostgreSQL blobs (1 GB size limits, I've read).

Storing files externally, storing pathnames in the database:
Pros:
- You can access and manage files from outside the database and
possibly using different interfaces.
- There's a lot less to store directly in the database.
- You can use existing file-system permissions, mechanisms, and limits.
Cons:
- You are dealing with two storage systems and two different locking
systems which are unlikely to play nice with each other. Transactions
are not guaranteed to be atomic (e.g. a database rollback will not
rollback a file system operation, a commit will not guarantee that
data in a file will stay).
- The file system has to be seen by the database system and any remote
clients that wish to use your application, meaning that a networked FS
is likely to be used (depending on how many clients you have and how
you like to separate services), with all the fun that comes from
administering one of those. Note that this one in particular really
only applies to enterprise-level installations, not smaller
installations like the original poster's.
- If you don't put files on a separate disk-system or networked FS,
you can get poor performance from the disk thrashing between the
database and the files.

There are a couple main points:
1. The favorite answer in computing, it depends, applies here. What
you decide depends on your storage system, your service and
installation policies, and how important fully atomic transactions are
to you.
2. If you want optimal performance out of either of these basic
models, you should make proper use of separate disk systems. I have no
idea which one is faster (it depends, I'm sure) nor do I have much of
an idea of how to benchmark this properly.

Peter

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] What is the best way to storage music files in Postgresql

2008-03-17 Thread Peter Koczan
  It seems to me as such a database gets larger, it will become much harder to 
 manage with the 2 systems.  I am talking mostly about music.  So each song 
 should not get too large.

I was just talking about points to consider in general. Getting to
your specific situation...

As far as BLOBs vs. file pointers. Test it out, use what you're most
comfortable using.

I would not set up a networked file system for the sole purpose of
managing and storing files a database will point to. If you already
have access to a networked file system, consider that as an option,
but don't add more work for yourself if you don't have to. Many
applications I work on use the database to store pathnames while the
files themselves are stored in a networked file system. It's honestly
not a huge pain to manage this if it's already available, but as I
mentioned before, there are caveats.

Also, in my experiences, the amount of management you do in a database
doesn't directly depending on the amount of data you put in. In other
words, your database shouldn't become much more difficult to manage
over time if all you are doing is adding more rows to tables.

 I have read alot on this list and on other resources and there seems to be 
 leanings toward 1+0 raids for storage.  It seems to the most flexible when it 
 comes to speed, redundancy and recovery time.  I do want my database to be 
 fully atomic.  I think that is important as this database grows.  Are my 
 assumptions wrong?


As far as RAID levels go, RAID 10 is usually optimal for databases, so
your assumptions are correct. The extra cost for disks, I believe, is
paid off by the advantages you mentioned, at least for typical
database-related workloads. RAID 0 doesn't allow for any disaster
recovery, RAID 1 is ok as long as you can handle having only 2 disks
available, and RAID 5 and RAID 6 are just huge pains and terribly slow
for writes.

Note that you should go for a battery-backup if you use hardware RAID.

Hope this helps.

Peter

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Anyone using a SAN?

2008-03-14 Thread Peter Koczan
Hi all,

I had a few meetings with SAN vendors and I thought I'd give you some
follow-up on points of potential interest.

- Dell/EMC
The representative was like the Dell dude grown up. The sales pitch
mentioned price point about twenty times (to the point where it was
annoying), and the pitch ultimately boiled down to Dude, you're
getting a SAN. My apologies in advance to bringing back repressed
memories of the Dell dude. As far as technical stuff goes, it's about
what you'd expect from a low-level SAN. The cost for a SAN was in the
$2-3 per GB range if you went with the cheap option...not terrible,
but not great either, especially since you'd have to buy lots of GB.
Performance numbers weren't bad, but they weren't great either.

- 3par
The sales pitch was more focused on technical aspects and only
mentioned price point twice...which is a win in my books, at least
compared to Dell. Their real place to shine was in the technical
aspect. Whereas Dell just wanted to sell you a storage system that you
put on a network, 3par wanted to sell you a storage system
specifically designed for a network, and change the very way you think
about storage. They had a bunch of cool management concepts, and very
advanced failover, power outage, and backup techniques and tools.
Performance wasn't shabby, either, for instance a RAID 5 set could get
about 90% the IOPS and transfer rate that a RAID 10 set could. How
exactly this compares to DAS they didn't say. The main stumbling block
with 3par is price. While they didn't give any specific numbers, best
estimates put a SAN in the $5-7 per GB range. The extra features just
might be worth it though.

- Lefthand
This is going to be an upcoming meeting, so I don't have as good of an
opinion. Looking at their website, they seem more to the Dell end in
terms of price and functionality. I'll keep you in touch as I have
more info. They seem good for entry-level SANs, though.

Luckily, almost everything here works with Linux (at least the major
distros), including the management tools, in case people were worried
about that. One of the key points to consider going forward is that
the competition of iSCSI and Fibre Channel techs will likely bring
price down in the future. While SANs are certainly more expensive than
their DAS counterparts, the gap appears to be closing.

However, to paraphrase a discussion between a few of my co-workers,
you can buy toilet paper or kitty litter in huge quantities because
you know you'll eventually use it...and it doesn't change in
performance or basic functionality. Storage is just something that you
don't always want to buy a lot of in one go. It will get bigger, and
cheaper, and probably faster in a relatively short amount of time. The
other thing is that you can't really get a small SAN. The minimum is
usually in the multiple TB range (and usually 10 TB). I'd love to be
able to put together a proof of concept and a test using 3par's
technology and commodity 80GB slow disks, but I really can't. You're
stuck with going all-in right away, and enough people have had
problems being married to specific techs or vendors that it's really
hard to break that uneasiness.

Thanks for reading, hopefully you found it slightly informative.

Peter

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] disabling an index without deleting it?

2008-02-26 Thread Peter Koczan
This might be a weird question...is there any way to disable a
particular index without dropping it?

There are a few queries I run where I'd like to test out the effects
of having (and not having) different indexes on particular query plans
and performance. I'd really prefer not to have to drop and ultimately
recreate a particular index, as some of the data sets are quite large.

So, is there any way to do this, or at least mimic this sort of behavior?

Peter

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Anyone using a SAN?

2008-02-18 Thread Peter Koczan
 That's true about SANs in general. You don't buy a SAN because it'll
 cost less than just buying the disks and a controller. You buy a SAN
 because it'll let you make managing it easier. The break-even point has
 more to do with how many servers you're able to put on the SAN and how
 often you need to do tricky backup and upgrade procedures than it
 doeswith the hardware.

One big reason we're really looking into a SAN option is that we have
a lot of unused disk space. A typical disk usage scheme for us is 6 GB
for a clean Linux install, and 20 GB for a Windows install. Our disks
are typically 80GB, and even after decent amounts of usage we're not
even approaching half that. We install a lot of software in AFS, our
networked file system, and users' home directories and project
directories are in AFS as well. Local disk space is relegated to the
OS and vendor software, servers that need it, and seldom-used scratch
space. There might very well be a break-even point for us in terms of
cost.

One of the other things I was interested in was the hidden costs of
a SAN. For instance, we'd probably have to invest in more UPS capacity
to protect our data. Are there any other similar points that people
don't initially consider regarding a SAN?

Again, thanks for all your help.

Peter

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[PERFORM] Anyone using a SAN?

2008-02-13 Thread Peter Koczan
Hi all,

We're considering setting up a SAN where I work. Is there anyone using
a SAN, for postgres or other purposes? If so I have a few questions
for you.

- Are there any vendors to avoid or ones that are particularly good?

- What performance or reliability implications exist when using SANs?

- Are there any killer features with SANs compared to local storage?

Any other comments are certainly welcome.

Peter

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Anyone using a SAN?

2008-02-13 Thread Peter Koczan
Thanks for all your input, it is very helpful. A SAN for our postgres
deployment is probably sufficient in terms of performance, because we
just don't have that much data. I'm a little concerned about needs for
user and research databases, but if a project needs a big, fast
database, it might be wise to have them shell out for DAS.

My co-workers and I are meeting with a vendor in two weeks (3Par,
specifically), and I think I have a better idea of what I should be
looking at. I'll keep you all up on the situation. Keep the ideas
coming as I still would like to know of any other important factors.

Thanks again.

Peter

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Join Query Perfomance Issue

2008-02-12 Thread Peter Koczan
 I have serious performance problems with the following type of queries:

 Doesnt looks too bad to me, but i'm not that deep into sql query
 optimization. However, these type of query is used in a function to
 access a normalized, partitioned database, so better performance in this
 queries would speed up the whole database system big times.
 Any suggestions here would be great. I allready tested some things,
 using inner join, rearranging the order of the tables, but but only
 minor changes in the runtime, the version above seemed to get us the
 best performance.

Can you send the table definitions of the tables involved in the
query, including index information? Might be if we look hard enough we
can find something.

Peter

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] TB-sized databases

2007-11-27 Thread Peter Koczan
Thanks all. This is just what I needed.

On Nov 26, 2007 1:16 PM, Stephen Cook [EMAIL PROTECTED] wrote:
 I think either would work; both PostgreSQL and MS SQL Server have
 success stories out there running VLDBs.  It really depends on what you
 know and what you have.  If you have a lot of experience with Postgres
 running on Linux, and not much with SQL Server on Windows, of course the
 former would be a better choice for you.  You stand a much better chance
 working with tools you know.



 Pablo Alcaraz wrote:
  I had a client that tried to use Ms Sql Server to run a 500Gb+ database.
  The database simply colapsed. They switched to Teradata and it is
  running good. This database has now 1.5Tb+.
 
  Currently I have clients using postgresql huge databases and they are
  happy. In one client's database the biggest table has 237Gb+ (only 1
  table!) and postgresql run the database without problem using
  partitioning, triggers and rules (using postgresql 8.2.5).
 
  Pablo
 
  Peter Koczan wrote:
  Hi all,
 
  I have a user who is looking to store 500+ GB of data in a database
  (and when all the indexes and metadata are factored in, it's going to
  be more like 3-4 TB). He is wondering how well PostgreSQL scales with
  TB-sized databases and what can be done to help optimize them (mostly
  hardware and config parameters, maybe a little advocacy). I can't
  speak on that since I don't have any DBs approaching that size.
 
  The other part of this puzzle is that he's torn between MS SQL Server
  (running on Windows and unsupported by us) and PostgreSQL (running on
  Linux...which we would fully support). If any of you have ideas of how
  well PostgreSQL compares to SQL Server, especially in TB-sized
  databases, that would be much appreciated.
 
  We're running PG 8.2.5, by the way.
 
  Peter
 
  ---(end of broadcast)---
  TIP 4: Have you searched our list archives?
 
 http://archives.postgresql.org
 
 
 
 
  ---(end of broadcast)---
  TIP 5: don't forget to increase your free space map settings
 

 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] TB-sized databases

2007-11-26 Thread Peter Koczan
Hi all,

I have a user who is looking to store 500+ GB of data in a database
(and when all the indexes and metadata are factored in, it's going to
be more like 3-4 TB). He is wondering how well PostgreSQL scales with
TB-sized databases and what can be done to help optimize them (mostly
hardware and config parameters, maybe a little advocacy). I can't
speak on that since I don't have any DBs approaching that size.

The other part of this puzzle is that he's torn between MS SQL Server
(running on Windows and unsupported by us) and PostgreSQL (running on
Linux...which we would fully support). If any of you have ideas of how
well PostgreSQL compares to SQL Server, especially in TB-sized
databases, that would be much appreciated.

We're running PG 8.2.5, by the way.

Peter

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Memory Settings....

2007-10-22 Thread Peter Koczan
I recently tweaked some configs for performance, so I'll let you in on
what I changed.

For memory usage, you'll want to look at shared_buffers, work_mem, and
maintenance_work_mem. Postgres defaults to very low values of this,
and to get good performance and not a lot of disk paging, you'll want
to raise those values (you will need to restart the server and
possibly tweak some memory config for lots of shared_buffers, I had to
raise SHMMAX on Linux, but I don't know the Windows analogue). The
basic rule of thumb for shared_buffers is 25%-50% of main memory,
enough to use main memory but leaving some to allow work_mem to do its
thing and allow any other programs to run smoothly. Tweak this as
necessary.

The other big thing is the free space map, which tracks free space and
helps to prevent index bloat. A VACUUM VERBOSE in a database will tell
you what these values should be set to.

Go here for full details:
http://www.postgresql.org/docs/8.2/static/runtime-config.html, especially
http://www.postgresql.org/docs/8.2/static/runtime-config-resource.html

Peter

On 10/22/07, Lee Keel [EMAIL PROTECTED] wrote:



 I have a client server that is dedicated to being a Postgres 8.2.4 database
 server for many websites.  This server will contain approximately 15
 databases each containing between 40-100 tables.  Each database will have
 approximately 7 web applications pulling data from it, but there will
 probably be no more than 50 simultaneous requests.  The majority of the
 tables will be very small tables around 1K in total size.  However, most of
 the queries will be going to the other 10-15 tables that are in each
 database that will contain postgis shapes.  These tables will range in size
 from 50 to 730K rows and each row will range in size from a 2K to 3MB.  The
 data will be truncated and reinserted as part of a nightly process but other
 than that, there won't be many writes during the day.  I am trying to tune
 this server to its maximum capacity.  I would appreciate any advice on any
 of the settings that I should look at.  I have not changed any of the
 settings before because I have never really needed to.  And even now, I have
 not experienced any bad performance, I am simply trying to turn the track
 before the train gets here.

 Server Specification:

 Windows 2003 Enterprise R2

 Dual-Quad Core 2.33GHz

 8GB RAM

 263 GB HD (I am not 100% on drive speed, but I think it is 15K)


 Thanks in advance,

 Lee Keel

  This email and any files transmitted with it are confidential and intended
 solely for the use of the individual or entity to whom they are addressed.
 If you have received this email in error please notify the sender. This
 message contains confidential information and is intended only for the
 individual named. If you are not the named addressee you should not
 disseminate, distribute or copy this e-mail.

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] sequence query performance issues

2007-10-01 Thread Peter Koczan
 *light bulb* Ahhh, that's it. So, I guess the solution is either
 to cast the column or wait for 8.3 (which isn't a problem since the
 port won't be done until 8.3 is released anyway).

Just a quick bit of follow-up:

This query works and is equivalent to what I was trying to do (minus
the randomization and limiting):
= select a.uid from generate_series(1000, 32000) as a(uid) where
a.uid::smallint not in (select uid from people where uid is not null);

It turns out that this and using coalesce are a wash in terms of
performance, usually coming within 10 ms of each other no matter what
limit and ordering constraints you put on the queries.

Peter

= explain analyze select a.uid from generate_series(1000, 32767) as
a(uid) where a.uid not in (select coalesce(uid, 0) from people);
 QUERY PLAN
-
 Function Scan on generate_series a  (cost=718.41..733.41 rows=500
width=4) (actual time=68.742..186.340 rows=26808 loops=1)
   Filter: (NOT (hashed subplan))
   SubPlan
 -  Seq Scan on people  (cost=0.00..702.68 rows=6294 width=2)
(actual time=0.025..28.368 rows=6294 loops=1)
 Total runtime: 286.311 ms
(5 rows)

= explain analyze select a.uid from generate_series(1000, 32767) as
a(uid) where a.uid::smallint not in (select uid from people where uid
is not null);
 QUERY PLAN
-
 Function Scan on generate_series a  (cost=699.34..716.84 rows=500
width=4) (actual time=58.508..177.683 rows=26808 loops=1)
   Filter: (NOT (hashed subplan))
   SubPlan
 -  Seq Scan on people  (cost=0.00..686.94 rows=4958 width=2)
(actual time=0.017..23.123 rows=4971 loops=1)
   Filter: (uid IS NOT NULL)
 Total runtime: 277.699 ms
(6 rows)

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] sequence query performance issues

2007-09-28 Thread Peter Koczan
  Hmm - why is it doing that?

 I'm betting that the OP's people.uid column is not an integer.  Existing
 PG releases can't use hashed subplans for cross-data-type comparisons
 (8.3 will be a bit smarter).

*light bulb* Ahhh, that's it. So, I guess the solution is either
to cast the column or wait for 8.3 (which isn't a problem since the
port won't be done until 8.3 is released anyway).

Thanks again.

Peter

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] sequence query performance issues

2007-09-27 Thread Peter Koczan
Hello,

I have a weird performance issue with a query I'm testing. Basically,
I'm trying to port a function that generates user uids, and since
postgres offers a sequence generator function, I figure I'd take
advantage of that. Basically, I generate our uid range, filter out
those which are in use, and randomly pick however many I need.
However, when I run it it takes forever (10 minutes and I get nothing
so I cancelled the query) and cpu usage on the server is maxed out.

Here's my query (I'll post the explain output later so as not to
obscure my question):
= select a.uid from generate_series(1000, 32767) as a(uid) where
a.uid not in (select uid from people) order by random() limit 1;

I thought that nulls were a problem, so I tried:
= select a.uid from generate_series(1000, 32767) as a(uid) where
a.uid not in (select coalesce(uid,0) from people) order by random()
limit 1;
And that finished in less than a second.

I then tried:
= select a.uid from generate_series(1000, 32767) as a(uid) where
a.uid not in (select coalesce(uid,0) from people where uid is not
null) order by random() limit 1;
And we're back to taking forever.

So I have 2 questions:

- Is there a better query for this purpose? Mine works when coalesced,
but it seems a little brute-force and the random() sorting, while
kinda nice, is slow.

- Is this in any way expected? I know that nulls sometimes cause
problems, but why is it taking forever even when trying to filter
those out?

Thanks.

Peter

The gory details:
- There is an btree index on people(uid), and there are ~6300 rows, of
which ~1300 have null uids.

- EXPLAIN output (I couldn't get EXPLAIN ANALYZE output from the first
two queries since they took too long):
= explain select a.uid from generate_series(1000, 32767) as a(uid)
where a.uid not in (select uid from people) order by random() limit 1;
QUERY PLAN
--
 Limit  (cost=40025.57..40025.60 rows=10 width=4)
   -  Sort  (cost=40025.57..40026.82 rows=500 width=4)
 Sort Key: random()
 -  Function Scan on generate_series a
(cost=693.16..40003.16 rows=500 width=4)
   Filter: (NOT (subplan))
   SubPlan
 -  Materialize  (cost=693.16..756.03 rows=6287 width=2)
   -  Seq Scan on people  (cost=0.00..686.87
rows=6287 width=2)
(8 rows)

= explain select a.uid from generate_series(1000, 32767) as a(uid)
where a.uid not in (select uid from people where uid is not null)
order by random() limit 1;
QUERY PLAN
--
 Limit  (cost=31486.71..31486.73 rows=10 width=4)
   -  Sort  (cost=31486.71..31487.96 rows=500 width=4)
 Sort Key: random()
 -  Function Scan on generate_series a
(cost=691.79..31464.29 rows=500 width=4)
   Filter: (NOT (subplan))
   SubPlan
 -  Materialize  (cost=691.79..741.00 rows=4921 width=2)
   -  Seq Scan on people  (cost=0.00..686.87
rows=4921 width=2)
 Filter: (uid IS NOT NULL)
(9 rows)

= explain select a.uid from generate_series(1000, 32767) as a(uid)
where a.uid not in (select coalesce(uid, 0) from people) order by
random() limit 1;
   QUERY PLAN

 Limit  (cost=756.97..756.99 rows=10 width=4)
   -  Sort  (cost=756.97..758.22 rows=500 width=4)
 Sort Key: random()
 -  Function Scan on generate_series a  (cost=718.30..734.55
rows=500 width=4)
   Filter: (NOT (hashed subplan))
   SubPlan
 -  Seq Scan on people  (cost=0.00..702.59 rows=6287 width=2)
(7 rows)

= explain analyze select a.uid from generate_series(1000, 32767) as
a(uid) where a.uid not in (select coalesce(uid, 0) from people) order
by random() limit 1;
   QUERY PLAN
-
 Limit  (cost=756.97..756.99 rows=10 width=4) (actual
time=370.444..370.554 rows=10 loops=1)
   -  Sort  (cost=756.97..758.22 rows=500 width=4) (actual
time=370.434..370.472 rows=10 loops=1)
 Sort Key: random()
 -  Function Scan on generate_series a  (cost=718.30..734.55
rows=500 width=4) (actual time=70.018..199.540 rows=26808 loops=1)
   Filter: (NOT (hashed subplan))
   SubPlan
 -  Seq Scan on people  (cost=0.00..702.59 rows=6287
width=2) (actual time=0.023..29.167 rows=6294 loops=1)
 Total runtime: 372.224 ms
(8 rows)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by 

Re: [PERFORM] Tablespaces and NFS

2007-09-20 Thread Peter Koczan
 Anyway...  One detail I don't understand --- why do you claim that
 You can't take advantage of the shared file system because you can't
 share tablespaces among clusters or servers ???

I say that because you can't set up two servers to point to the same
tablespace (i.e. you can't have server A and server B both point to
the tablespace in /mnt/nfs/postgres/), which basically defeats one of
the main purposes of using a shared file system, seeing, using, and
editing files from anywhere.

This is ill-advised and probably won't work for 2 reasons.

- Postgres tablespaces require empty directories to for
initialization. If you create a tablespace on server A, it puts files
in the previously empty directory. If you then try to create a
tablespace on server B pointing to the same location, it won't work
since the directory is no longer empty. You can get around this, in
theory, but you'd either have to directly mess with system tables or
fool Postgres into thinking that each server independently created
that tablespace (to which anyone will say, NO).

- If you do manage to fool postgres into having two servers pointing
at the same tablespace, the servers really, REALLY won't play nice
with these shared resources, since they have no knowledge of each
other (i mean, two clusters on the same server don't play nice with
memory). Basically, if they compete for the same file, either I/O will
be EXTREMELY slow because of file-locking mechanisms in the file
system, or you open things up to race conditions and data corruption.
In other words: BAD

I know this doesn't fully apply to you, but I thought I should explain
my points betters since you asked so nicely :-)

 This seems to be the killer point --- mainly because the network
 connection is a 100Mbps  (around 10 MB/sec --- less than 1/4 of
 the performance we'd expect from an internal hard drive).  If at
 least it was a Gigabit connection, I might still be tempted to
 retry the experiment.  I was thinking that *maybe* the latencies
 and contention due to heads movements (in the order of the millisec)
 would take precedence and thus, a network-distributed cluster of
 hard drives would end up winning.

If you get decently fast disks, or put some slower disks in RAID 10,
you'll easily get 100 MB/sec (and that's a conservative estimate).
Even with a Gbit network, you'll get, in theory 128 MB/sec, and that's
assuming that the NFS'd disks aren't a bottleneck.

 We're clear that that would be the *optimal* solution --- problem
 is, there's a lot of client-side software that we would have to
 change;  I'm first looking for a transparent solution in which
 I could distribute the load at a hardware level, seeing the DB
 server as a single entity --- the ideal solution, of course,
 being the use of tablespaces with 4 or 6 *internal* hard disks
 (but that's not an option with our current web hoster).

I sadly don't know enough networking to tell you tell the client
software no really, I'm over here. However, one of the things I'm
fond of is using a module to store connection strings, and dynamically
loading said module on the client side. For instance, with Perl I
use...

use DBI;
use DBD::Pg;
use My::DBs;

my $dbh = DBI-connect($My::DBs::mydb);

Assuming that the module and its entries are kept up to date, it will
just work. That way, there's only 1 module to change instead of n
client apps. I can have a new server with a new name up without
changing any client code.

 Anyway, I'll keep working on alternative solutions --- I think
 I have enough evidence to close this NFS door.

That's probably for the best.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Tablespaces and NFS

2007-09-19 Thread Peter Koczan
On 9/19/07, Carlos Moreno [EMAIL PROTECTED] wrote:
 Hi,

 Anyone has tried a setup combining tablespaces with NFS-mounted partitions?

 I'm considering the idea as a performance-booster --- our problem is
 that we are
 renting our dedicated server from a hoster that does not offer much
 flexibility
 in terms of custom hardware configuration;  so, the *ideal* alternative
 to load
 the machine with 4 or 6 hard drives and use tablespaces is off the table
 (no pun
 intended).

 We could, however, set up a few additional servers where we could configure
 NFS shares, mount them on the main PostgreSQL server, and configure
 tablespaces to load balance the access to disk.

 Would you estimate that this will indeed boost performance??  (our system
 does lots of writing to DB --- in all forms:  inserts, updates, and deletes)

 As a corollary question:  what about the WALs and tablespaces??  Are the
 WALs distributed when we setup a tablespace and create tables in it?
 (that is, are the WALs corresponding to the tables in a tablespace stored
 in the directory corresponding to the tablespace?  Or is it only the
 data, and
 the WAL keeps being the one and only?)

 Thanks,

 Carlos

About 5 months ago, I did an experiment serving tablespaces out of
AFS, another shared file system.

You can read my full post at
http://archives.postgresql.org/pgsql-admin/2007-04/msg00188.php

On the whole, you're not going to see a performance improvement
running tablespaces on NFS (unless the disk system on the NFS server
is a lot faster) since you have to go through the network as well as
NFS, both of which add overhead.

Usually, locking mechanisms on shared file systems don't play nice
with databases. You're better off using something else to load balance
or replicate data.

Peter

P.S. Why not just set up those servers you're planning on using as NFS
shares as your postgres server(s)?

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org