[PERFORM] The right SHMMAX and FILE_MAX

2011-04-30 Thread Phoenix Kiula
Hi. I'm on a 64 Bit CentOS 5 system, quadcore processor, 8GB RAM and
tons of data storage (1 TB SATAII disks).

The current SHMMAX and SHMMIN are (commas added for legibility) --

kernel.shmmax = 68,719,476,736
kernel.shmall = 4,294,967,296

Now, according to my reading in the PG manual and this list, a good
recommended value for SHMMAX is

   (shared_buffers * 8192)

My postgresql.conf settings at the moment are:

max_connections = 300
shared_buffers = 300MB
effective_cache_size = 2000MB

By this calculation, shared_b * 8192 will be:

 2,457,600,000,000

That's a humongous number. So either the principle for SHMMAX is
amiss, or I am reading this wrongly?

Similarly with "fs.file_max". There are articles like this one:
http://tldp.org/LDP/solrhe/Securing-Optimizing-Linux-RH-Edition-v1.3/chap6sec72.html

Is this relevant for PostgreSQL performance at all, or should I skip that?

Thanks for any pointers!

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] stored proc and inserting hundreds of thousands of rows

2011-04-30 Thread Samuel Gendler
On Sat, Apr 30, 2011 at 5:12 PM, Jeff Janes  wrote:

>
>
> gist indices are designed to make this type of thing fast, by using
> techniques to rule out most of those comparisons without actually
> performing them.  I don't know enough about the
> guts of either your distance function or the gist indexes to know if
> you can do it this way, but if you can it would certainly be the way
> to go.
>

It is definitely a good idea to consider a gist index for eliminating most
of a large dataset, if applicable.  Do a little reading on the topic and,
hopefully, it's applicability (or not) will become apparent.

However, as someone who has built a number of ad servers over the years, for
several of the larger ad networks, the first thing I'd do is separate your
ad serving from real-time interaction with your database, no matter what the
underlying technology.  If you ad serving is dependent on your database, it
means that hiccups in the database will have an impact on ad serving, which
is rarely tolerable.  And god forbid you should need to take the db down for
a period of maintenance. The reliability and performance required of most ad
servers is often greater than what should reasonably be expected of a
relational database, particularly if there are other processes accessing the
database, as is the case with your system. The first rule of ad serving is
that no outage of backend systems should ever be able to prevent or impact
front end ad serving. Some kind of in-memory cache of doc/ad mappings which
the ad server interacts with will serve you in good stead and will be much
easier to scale horizontally than most relational db architectures lend
themselves to.  If you have an ever increasing set of documents and ads,
you'll inevitably wind up 'sharding' your dataset across multiple db hosts
in order to maintain performance - which creates a number of maintenance
complexities. Easier to maintain a single database and do analytics over a
single data source, but insulate it from the real-time performance
requirements of your ad serving. Even something as simple as a process that
pushes the most recent doc/ad mappings into a memcache instance could be
sufficient - and you can scale your memcache across as many hosts as is
necessary to deliver the lookup latencies that you require no matter how
large the dataset.  Similarly, if you are updating the database from the ad
server with each ad served in order to record an impression or click, you'll
be far better off logging those and then processing the logs in bulk on a
periodic basis.  If subsequent impressions are dependent upon what has
already been served historically, then use your memcache instance (or
whatever structure you eventually choose to utilize) to handle those
lookups.  This gives you the convenience and flexibility of a relational
system with SQL for access, but without the constraints of the capabilities
of a single host limiting real-time performance of the system as a whole.


Re: [PERFORM] stored proc and inserting hundreds of thousands of rows

2011-04-30 Thread Jeff Janes
On Sat, Apr 30, 2011 at 3:29 PM, Joel Reymont  wrote:
>
> On Apr 30, 2011, at 11:11 PM, Jeff Janes wrote:
>
>> But what exactly are you inserting?  The queries you reported below
>> are not the same as the ones you originally described.
>
> I posted the wrong query initially. The only difference is in the table that 
> holds the probability array.
>
> I'm inserting document id and ad id pairs to show that this ad is not linked 
> to this document. The mapping table has a primary key on the serial document 
> id.

Having the (doc_id, ad_id) pair be missing from the table is a far
more efficient way to show that the ad is not linked to the document
(i.e. that it is below the threshold).  Provided that you are careful
that there are no other reasons that the pair could be missing--but if
you are not careful about that, then I don't see how storing the full
matrix will save you anyway.

>
>> In particular, they do not seem to use the "threshold" parameter that
>> the original ones did, whose job is presumably to cut the 2 million
>> down to a much smaller number that meet the threshold.  But how much
>> smaller is that number?
>
> The 5 minutes is with a threshold large enough to be irrelevant. I would like 
> to optimize the process before I apply the threshold to cut down the number 
> of rows.
>
>> It looks like "WHERE tab.distance <= 50.0;" is not accomplishing
>> anything.  Are you sure the parameter shouldn't be <=0.50 instead?
>
> No, ignore the threshold for now.

OK, but it seems to me that you are starting out by ruling out the one
optimization that is most likely to work.

>> Also, you previously said you didn't mind of this process took a
>> couple minutes, as long as it didn't interfere with other things going
>> on in the database.  So you probably need to describe what those other
>> things going on in the database are.
>
> Those other things are ad serving which boils down to a lookup of ad ids 
> linked to the document.
>
> This is a lookup from the mapping table using the primary key that goes on at 
> the same time as a large number of  mappings are being inserted into 
> the same table.

What numbers do you get for lookups per second when inserts are also
going on, versus when they are not going on?

The way I would approach this is by making two independent tasks, one
that insert records at your anticipated rate "insert into foo select
generate_series  from generate_series(1,10);" in a loop, and
another than generates select load against a separate table (like
pgbench -S) and see how the two interact with each other by competing
for CPU and IO.

You could throttle the insert process by adding pg_sleep() as a column in one of your selects, so it
pauses at every row.  But due to granularity of pg_sleep, you might
have to put it in a CASE expression so it is invoked on only a random
subset of the rows rather than each row.But once throttled, will
it be able to keep up with the flow of new docs and ads?


>
> I don't think I can do the matching when serving an ad, though, as I will 
> still need to scan millions of probability vectors (one per doc) to calculate 
> the distance between current document and existing ads.

gist indices are designed to make this type of thing fast, by using
techniques to rule out most of those comparisons without actually
performing them.  I don't know enough about the
guts of either your distance function or the gist indexes to know if
you can do it this way, but if you can it would certainly be the way
to go.

Cheers,

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] stored proc and inserting hundreds of thousands of rows

2011-04-30 Thread Joel Reymont

On Apr 30, 2011, at 11:11 PM, Jeff Janes wrote:

> But what exactly are you inserting?  The queries you reported below
> are not the same as the ones you originally described.

I posted the wrong query initially. The only difference is in the table that 
holds the probability array.

I'm inserting document id and ad id pairs to show that this ad is not linked to 
this document. The mapping table has a primary key on the serial document id.

> In particular, they do not seem to use the "threshold" parameter that
> the original ones did, whose job is presumably to cut the 2 million
> down to a much smaller number that meet the threshold.  But how much
> smaller is that number?

The 5 minutes is with a threshold large enough to be irrelevant. I would like 
to optimize the process before I apply the threshold to cut down the number of 
rows.

> It looks like "WHERE tab.distance <= 50.0;" is not accomplishing
> anything.  Are you sure the parameter shouldn't be <=0.50 instead?

No, ignore the threshold for now.

> Also, you previously said you didn't mind of this process took a
> couple minutes, as long as it didn't interfere with other things going
> on in the database.  So you probably need to describe what those other
> things going on in the database are.

Those other things are ad serving which boils down to a lookup of ad ids linked 
to the document. 

This is a lookup from the mapping table using the primary key that goes on at 
the same time as a large number of  mappings are being inserted into 
the same table.

Documents are uploaded into the system at a rate of 10k per day, once every 
couple of seconds. I wish I could get rid of storing the  mapping as 
that table is gonna grow absolutely huge when each new ad matches tens or 
hundreds of thousands of documents. 

I don't think I can do the matching when serving an ad, though, as I will still 
need to scan millions of probability vectors (one per doc) to calculate the 
distance between current document and existing ads.

Then again, the number of ads in the system will always be a fraction of the 
number of documents so, perhaps, the matching of document to ads can be done at 
runtime.

> Also, you might have a data correctness problem.  If the plan is to
> scan new ads against all docs, and new docs against all ads;

That's basically it. 

As new ads are entered, they need to be matched with existing documents. 

As new documents are entered, they need to be matched with existing ads. 

Both ads and docs are represented by probability vectors of 150 floats so it's 
the same distance calculation.

> then if new rows are added to each table during overlapping transaction, the
> new ads against new docs comparison will not actually happen.  You
> will probably need to add manual locking to get around this problem.

I'll ponder this, thanks for pointing it out!


--
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
-++---
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
-++---




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] stored proc and inserting hundreds of thousands of rows

2011-04-30 Thread Jeff Janes
On Sat, Apr 30, 2011 at 2:15 PM, Joel Reymont  wrote:
>
> On Apr 30, 2011, at 7:24 PM, Kevin Grittner wrote:
>
>> If this is where most of the time is, the next thing is to run it
>> with EXPLAIN ANALYZE, and post the output.
>
> I was absolutely wrong about the calculation taking < 1s, it actually takes 
> about 30s for 2 million rows.
>
> Still, the difference between 5 minutes and 30s must be the insert.

But what exactly are you inserting?  The queries you reported below
are not the same as the ones you originally described.

In particular, they do not seem to use the "threshold" parameter that
the original ones did, whose job is presumably to cut the 2 million
down to a much smaller number that meet the threshold.  But how much
smaller is that number?  This will have a large effect on how long the
insert takes.


...

> Digging deeper into the distance function,
>
> EXPLAIN ANALYZE VERBOSE
> SELECT *
> FROM (SELECT id, divergence(, topics) AS distance FROM docs) AS 
> tab
> WHERE tab.distance <= 50.0;
>
> Subquery Scan on tab  (cost=0.00..38.00 rows=53 width=12) (actual 
> time=0.027..20429.299 rows=202 loops=1)
>  Output: tab.id, tab.distance
>  Filter: (tab.distance <= 50::double precision)
>  ->  Seq Scan on public.docs  (cost=0.00..358333.50 rows=160 width=36) 
> (actual time=0.025..19908.200 rows=202 loops=1)
>        Output: docs.id, divergence((::double 
> precision[])::topics, docs.topics)

It looks like "WHERE tab.distance <= 50.0;" is not accomplishing
anything.  Are you sure the parameter shouldn't be <=0.50 instead?

Also, you previously said you didn't mind of this process took a
couple minutes, as long as it didn't interfere with other things going
on in the database.  So you probably need to describe what those other
things going on in the database are.

Also, you might have a data correctness problem.  If the plan is to
scan new ads against all docs, and new docs against all ads; then if
new rows are added to each table during overlapping transaction, the
new ads against new docs comparison will not actually happen.  You
will probably need to add manual locking to get around this problem.

Cheers

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] stored proc and inserting hundreds of thousands of rows

2011-04-30 Thread Joel Reymont

On Apr 30, 2011, at 7:36 PM, Kevin Grittner wrote:

> It may even be amenable to knnGiST indexing (a new feature coming in
> 9.1), which would let you do your select with an ORDER BY on the
> distance.

I don't think I can wait for 9.1, need to go live in a month, with PostgreSQL 
or without.

> PostgreSQL has a lot of very cool features you just don't have in any other 
> product!  :-)


There's a strong NoSQL lobby here and I'm trying my best to show that PG can 
handle the job!

--
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
-++---
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
-++---




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] stored proc and inserting hundreds of thousands of rows

2011-04-30 Thread Joel Reymont

On Apr 30, 2011, at 7:24 PM, Kevin Grittner wrote:

> If this is where most of the time is, the next thing is to run it
> with EXPLAIN ANALYZE, and post the output.

I was absolutely wrong about the calculation taking < 1s, it actually takes 
about 30s for 2 million rows.

Still, the difference between 5 minutes and 30s must be the insert.

SELECT (t).doc_id, (t).distance
 FROM (SELECT docs_within_distance('{ 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.586099770475, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.167233562858, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667, 
0.001667, 0.001667, 0.001667, 0.001667 }', 
50.0) as t) AS x;

This takes 27.44 seconds

EXPLAIN ANALYZE VERBOSE

Subquery Scan on x  (cost=0.00..0.27 rows=1 width=32) (actual 
time=22422.418..23835.468 rows=202 loops=1)
  Output: (t).doc_id, (t).distance
  ->  Result  (cost=0.00..0.26 rows=1 width=0) (actual 
time=22422.410..23184.086 rows=202 loops=1)
Output: docs_within_distance(('{}'::double 
precision[])::topics, 50::double precision)
Total runtime: 23948.563 ms

Topics is defined thusly: CREATE DOMAIN topics AS float[150];

Digging deeper into the distance function,

EXPLAIN ANALYZE VERBOSE
SELECT * 
FROM (SELECT id, divergence(, topics) AS distance FROM docs) AS tab
WHERE tab.distance <= 50.0;

Subquery Scan on tab  (cost=0.00..38.00 rows=53 width=12) (actual 
time=0.027..20429.299 rows=202 loops=1)
  Output: tab.id, tab.distance
  Filter: (tab.distance <= 50::double precision)
  ->  Seq Scan on public.docs  (cost=0.00..358333.50 rows=160 width=36) 
(actual time=0.025..19908.200 rows=202 loops=1)
Output: docs.id, divergence((::double 
precision[])::topics, docs.topics)
Total runtime: 20550.019 ms

I can't dig any deeper because divergence is a C function.

Thanks, Joel

--
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
-++---
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
-++---




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] stored proc and inserting hundreds of thousands of rows

2011-04-30 Thread Kevin Grittner
Joel Reymont  wrote:
 
> I'm calculating distance between probability vectors, e.g. topics
> that a document belongs to and the topics of an ad.
> 
> The distance function is already a C function. Topics are
> float8[150].
> 
> Distance is calculated against all documents in the database
 
There's probably a way to index that so that you don't need to do a
full calculation against all documents in the database each time. 
It may even be amenable to knnGiST indexing (a new feature coming in
9.1), which would let you do your select with an ORDER BY on the
distance.
 
PostgreSQL has a lot of very cool features you just don't have in
any other product!  :-)
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] stored proc and inserting hundreds of thousands of rows

2011-04-30 Thread Kevin Grittner
[rearranging to correct for top-posting]
 
Joel Reymont  wrote:
> Kevin Grittner  wrote:
>> Joel Reymont  wrote:
>>
>>> We have 2 million documents now and linking an ad to all of them
>>> takes 5 minutes on my top-of-the-line SSD MacBook Pro.
>>
>> How long does it take to run just the SELECT part of the INSERT
>> by itself?
 
> Are you suggesting eliminating the physical linking and
> calculating matching documents on the fly?
 
I'm not suggesting anything other than it being a good idea to
determine where the time is being spent before trying to make it
faster.  You showed this as the apparent source of the five minute
delay:
 
  INSERT INTO doc_ads (doc_id, ad_id, distance)
  SELECT doc, (t).ad_id, (t).distance
  FROM (SELECT ads_within_distance(topics, threshold) AS t
   FROM docs
   WHERE id = doc) AS x;
 
What we don't know is how much of that time is due to writing to the
doc_ads table, and how much is due to reading the other tables.  We
can find that out by running this:
 
  SELECT doc, (t).ad_id, (t).distance
  FROM (SELECT ads_within_distance(topics, threshold) AS t
   FROM docs
   WHERE id = doc) AS x;
 
If this is where most of the time is, the next thing is to run it
with EXPLAIN ANALYZE, and post the output.  It's a whole different
set of things to try to tune if that part is fast and the INSERT
itself is slow.
 
Of course, be aware of caching effects when you time this.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] stored proc and inserting hundreds of thousands of rows

2011-04-30 Thread Joel Reymont
I'm calculating distance between probability vectors, e.g. topics that
a document belongs to and the topics of an ad.

The distance function is already a C function. Topics are float8[150].

Distance is calculated against all documents in the database so it's
arable scan.

Sent from my comfortable recliner

On 30/04/2011, at 19:04, Pierre C  wrote:

>
> If you want to search by geographical coordinates, you could use a gist index 
> which can optimize that sort of things (like retrieving all rows which fit in 
> a box).

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] stored proc and inserting hundreds of thousands of rows

2011-04-30 Thread Pierre C


If you want to search by geographical coordinates, you could use a gist  
index which can optimize that sort of things (like retrieving all rows  
which fit in a box).


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] stored proc and inserting hundreds of thousands of rows

2011-04-30 Thread Joel Reymont
Calculating distance involves giving an array of 150 float8 to a pgsql
function, then calling a C function 2 million times (at the moment),
giving it two arrays of 150 float8.

Just calculating distance for 2 million rows and extracting the
distance takes less than a second. I think that includes sorting by
distance and sending 100 rows to the client.

Are you suggesting eliminating the physical linking and calculating
matching documents on the fly?

Is there a way to speed up my C function by giving it all the float
arrays, calling it once and having it return a set of matches? Would
this be faster than calling it from a select, once for each array?

Sent from my comfortable recliner

On 30/04/2011, at 18:28, Kevin Grittner  wrote:

> Joel Reymont  wrote:
>
>> We have 2 million documents now and linking an ad to all of them
>> takes 5 minutes on my top-of-the-line SSD MacBook Pro.
>
> How long does it take to run just the SELECT part of the INSERT by
> itself?
>
> -Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] stored proc and inserting hundreds of thousands of rows

2011-04-30 Thread Kevin Grittner
Joel Reymont  wrote:
 
> We have 2 million documents now and linking an ad to all of them
> takes 5 minutes on my top-of-the-line SSD MacBook Pro.
 
How long does it take to run just the SELECT part of the INSERT by
itself?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] stored proc and inserting hundreds of thousands of rows

2011-04-30 Thread Joel Reymont
I have a stored proc that potentially inserts hundreds of thousands, 
potentially millions, of rows (below).

This stored proc is part of the the sequence of creating an ad campaign and 
links an ad to documents it should be displayed with.

A few of these stored procs can run concurrently as users create ad campaigns.

We have 2 million documents now and linking an ad to all of them takes 5 
minutes on my top-of-the-line SSD MacBook Pro.

Last but not least, the system has to quickly serve ads while documents are 
being linked which is a problem at the moment.

What can I do to make linking documents to ads faster or have less impact on 
the system. I would like the system to be as responsive with serving ads while 
the linking itself is allowed to take a few minutes. 

One thing I'm concerned with, for example, is the whole multi-million row 
insert running within the stored proc transaction. I think inserting rows one 
by one or in small batches may be an improvement. I don't know how to 
accomplish this, though.

Thanks, Joel

---

CREATE DOMAIN doc_id AS varchar(64);
CREATE DOMAIN id AS int;

CREATE TABLE doc_ads
(
  doc_idid NOT NULL REFERENCES docs,
  ad_id id NOT NULL REFERENCES ads,
  distance  float NOT NULL
);

CREATE INDEX doc_ads_idx ON doc_ads(doc_id);

CREATE OR REPLACE FUNCTION link_doc_to_ads(doc id, threshold float) 
RETURNS void AS $$
BEGIN
  INSERT INTO doc_ads (doc_id, ad_id, distance)
  SELECT doc, (t).ad_id, (t).distance
  FROM (SELECT ads_within_distance(topics, threshold) AS t
   FROM docs
   WHERE id = doc) AS x;
  ANALYZE doc_ads;
END;
$$ LANGUAGE plpgsql;

--
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
-++---
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
-++---




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] REINDEX takes half a day (and still not complete!)

2011-04-30 Thread Kenneth Marshall
On Sat, Apr 30, 2011 at 05:26:36PM +0800, Phoenix Kiula wrote:
> On Sat, Apr 30, 2011 at 4:07 PM, Greg Smith  wrote:
> > On 04/23/2011 03:44 PM, Robert Haas wrote:
> >>
> >> On Apr 17, 2011, at 11:30 AM, Phoenix Kiula
> >> ?wrote:
> >>
> >>>
> >>> Postgres is 8.2.9.
> >>>
> >>>
> >>
> >> An upgrade would probably help you a lot, and as others have said it
> >> sounds like your hardware is failing, so you probably want to deal with 
> >> that
> >> first.
> >>
> >> I am a bit surprised, however, that no one seems to have mentioned using
> >> CLUSTER rather than VACUUM or REINDEX. Sometimes that's worth a try...
> >>
> >
> > Don't know if it was for this reason or not for not mentioning it by others,
> > but CLUSTER isn't so great in 8.2. ?The whole "not MVCC-safe" bit does not
> > inspire confidence on a production server.
> 
> 
> 
> 
> To everyone. Thanks so much for everything, truly. We have managed to
> salvage the data by exporting it in bits and pieces.
> 
> 1. First the schema only
> 2. Then pg_dump of specific small tables
> 3. Then pg_dump of timed bits of the big mammoth table
> 
> Not to jinx it, but the newer hardware seems to be doing well. I am on
> 9.0.4 now and it's pretty fast.
> 
> Also, as has been mentioned in this thread and other discussions on
> the list, just doing a dump and then fresh reload has compacted the DB
> to nearly 1/3rd of its previously reported size!
> 
> I suppose that's what I am going to do on a periodic basis from now
> on. There is a lot of DELETE/UPDATE activity. But I wonder if the
> vacuum stuff really should do something that's similar in function?
> What do the high-end enterprise folks do -- surely they can't be
> dumping/restoring every quarter or soor are they?
> 
> Anyway, many many thanks to the lovely folks on this list. Much appreciated!
> 

The autovacuum and space management in 9.0 is dramatically more effective
and efficient then that of 8.2. Unless you have an odd corner-case there
really should be no reason for a periodic dump/restore. This is not your
grandmother's Oldsmobile... :)

Regards,
Ken

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] REINDEX takes half a day (and still not complete!)

2011-04-30 Thread Phoenix Kiula
On Sat, Apr 30, 2011 at 4:07 PM, Greg Smith  wrote:
> On 04/23/2011 03:44 PM, Robert Haas wrote:
>>
>> On Apr 17, 2011, at 11:30 AM, Phoenix Kiula
>>  wrote:
>>
>>>
>>> Postgres is 8.2.9.
>>>
>>>
>>
>> An upgrade would probably help you a lot, and as others have said it
>> sounds like your hardware is failing, so you probably want to deal with that
>> first.
>>
>> I am a bit surprised, however, that no one seems to have mentioned using
>> CLUSTER rather than VACUUM or REINDEX. Sometimes that's worth a try...
>>
>
> Don't know if it was for this reason or not for not mentioning it by others,
> but CLUSTER isn't so great in 8.2.  The whole "not MVCC-safe" bit does not
> inspire confidence on a production server.




To everyone. Thanks so much for everything, truly. We have managed to
salvage the data by exporting it in bits and pieces.

1. First the schema only
2. Then pg_dump of specific small tables
3. Then pg_dump of timed bits of the big mammoth table

Not to jinx it, but the newer hardware seems to be doing well. I am on
9.0.4 now and it's pretty fast.

Also, as has been mentioned in this thread and other discussions on
the list, just doing a dump and then fresh reload has compacted the DB
to nearly 1/3rd of its previously reported size!

I suppose that's what I am going to do on a periodic basis from now
on. There is a lot of DELETE/UPDATE activity. But I wonder if the
vacuum stuff really should do something that's similar in function?
What do the high-end enterprise folks do -- surely they can't be
dumping/restoring every quarter or soor are they?

Anyway, many many thanks to the lovely folks on this list. Much appreciated!

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] REINDEX takes half a day (and still not complete!)

2011-04-30 Thread Greg Smith

On 04/23/2011 03:44 PM, Robert Haas wrote:

On Apr 17, 2011, at 11:30 AM, Phoenix Kiula  wrote:
   

Postgres is 8.2.9.

 

An upgrade would probably help you a lot, and as others have said it sounds 
like your hardware is failing, so you probably want to deal with that first.

I am a bit surprised, however, that no one seems to have mentioned using 
CLUSTER rather than VACUUM or REINDEX. Sometimes that's worth a try...
   


Don't know if it was for this reason or not for not mentioning it by 
others, but CLUSTER isn't so great in 8.2.  The whole "not MVCC-safe" 
bit does not inspire confidence on a production server.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance