Re: [PERFORM] mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
changes. It does change the location of the cache, from being in the userland, down in to the kernel. This is a change for database administrators, but a good one, IMHO. Previously, the operating system would be split 25% kernel, 75% user because PostgreSQL would need the available RAM for its cache. Now, that can be moved closer to the opposite, 75% kernel, 25% user because most of the memory is mmap(2)'ed pages instead of actual memory in the userland. *) Pages can be protected via PROT_(EXEC|READ|WRITE). For backends that aren't making changes to the DDL or system catalogs (permissions, etc.), pages that are loaded from the catalogs could be loaded with the protection PROT_READ, which would prevent changes to the catalogs. All DDL and permission altering commands (anything that touches the system catalogs) would then load the page with the PROT_WRITE bit set, make their changes, then PROT_READ the page again. This would provide a first line of defense against buggy programs or exploits. *) Eliminates the double caching done currently (caching in PostgreSQL and the kernel) by pushing the cache into the kernel... but without PostgreSQL knowing it's working on a page that's in the kernel. Please ask questions if you have them. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...
this. The SUS text is a bit weaselly ("the application must ensure correct synchronization") but the HPUX mmap man page, among others, lays it on the line: It is also unspecified whether write references to a memory region mapped with MAP_SHARED are visible to processes reading the file and whether writes to a file are visible to processes that have mapped the modified portion of that file, except for the effect of msync(). It might work on particular OSes but I think depending on such behavior would be folly... Agreed. Only OSes with a coherent file system buffer cache should ever use mmap(2). In order for this to work on HPUX, msync(2) would need to be used. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
As someone else noted, this doesn't belong in the filesystem (rather the kernel's block I/O layer/buffer cache). But I agree, an API by which we can tell the kernel what kind of I/O behavior to expect would be good. [snip] The closest API to what you're describing that I'm aware of is posix_fadvise(). While that is technically-speaking a POSIX standard, it is not widely implemented (I know Linux 2.6 implements it; based on some quick googling, it looks like AIX does too). Don't forget about the existence/usefulness/widely implemented madvise(2)/posix_madvise(2) call, which can give the OS the following hints: MADV_NORMAL, MADV_SEQUENTIAL, MADV_RANDOM, MADV_WILLNEED, MADV_DONTNEED, and MADV_FREE. :) -sc -- Sean Chittenden ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
lected. What performance issues do you think shared memory needs to have fixed? We don't issue any shmem kernel calls after the initial shmget, so comparing the level of kernel tenseness about shmget to the level of tenseness about mmap is simply irrelevant. Perhaps the reason you don't see any traffic about this on the kernel lists is that shared memory already works fine and doesn't need any fixing. I'm gunna get flamed for this, but I think its improperly used as a second level cache on top of the operating system's cache. mmap(2) would consolidate all caching into the kernel. Please ask questions if you have them. Do you have any arguments that are actually convincing? Three things come to mind. 1) A single cache for pages 2) Ability to give access hints to the kernel regarding future IO 3) On the fly memory use for a cache. There would be no need to preallocate slabs of shared memory on startup. And a more minor point would be: 4) Not having shared pages get lost when the backend dies (mmap(2) uses refcounts and cleans itself up, no need for ipcs/ipcrm/ipcclean). This isn't too practical in production though, but it sucks doing PostgreSQL development on OS-X because there is no ipcs/ipcrm command. What I just read was a proposal to essentially throw away not only the entire low-level data access model, but the entire low-level locking model, and start from scratch. From the above list, steps 2, 3, 5, 6, and 7 would be different than our current approach, all of which could be safely handled with some #ifdef's on platforms that don't have mmap(2). There is no possible way we could support both this approach and the current one, which means that we'd be permanently dropping support for all platforms without high-quality mmap implementations; Architecturally, I don't see anything different or incompatibilities that aren't solved with an #ifdef USE_MMAP/#else/#endif. Furthermore, you didn't give any really convincing reasons to think that the enormous effort involved would be repaid. Steven's has a great reimplementaion of cat(1) that uses mmap(1) and benchmarks the two. I did my own version of that here: http://people.freebsd.org/~seanc/mmap_test/ When read(2)'ing/write(2)'ing /etc/services 100,000 times without mmap(2), it takes 82 seconds. With mmap(2), it takes anywhere from 1.1 to 18 seconds. Worst case scenario with mmap(2) yields a speedup by a factor of four. Best case scenario... *shrug* something better than 4x. I doubt PostgreSQL would see 4x speedups in the IO department, but I do think it would be vastly greater than the 3% suggested. Those oprofile reports Josh just put up showed 3% of the CPU time going into userspace/kernelspace copying. Even assuming that that number consists entirely of reads and writes of shared buffers (and of course no other kernel call ever transfers any data across that boundary ;-)), there's no way we are going to buy into this sort of project in hopes of a 3% win. Would it be helpful if I created a test program that demonstrated that the execution path for writing mmap(2)'ed pages as outlined above? -sc -- Sean Chittenden ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Need advice on postgresql.conf settings
The real issue is this, we have THE SAME queries taking anywhere from .001 - 90.0 seconds... the server is using 98% of the available RAM at all times (because of the persistant connections via php), and I don't know what to do. Another possible line of attack is to use persistent (pooled) connections to cut down the number of live backend processes you need. However, depending on what your application software is, that might take more time/effort (= money) than dropping in some more RAM. This particular feature is pure evilness. Using all of my fingers and toes, I can't count the number of times I've had a client do this and get themselves into a world of hurt. Somewhere in the PHP documentation, there should be a big warning wrapped in the blink tag that steers people away from setting this. The extra time necessary to setup a TCP connection is less than the performance drag induced on the backend when persistent connections are enabled. Reread that last sentence until it sinks in. On a local network, this is premature optimization that's hurting you. max_files_per_process = 3052# min 25 You really have your kernel set to support 3052 * 75 simultaneously open files? Back this off. I doubt values beyond a couple hundred buy anything except headaches. This, on the other hand, has made a large difference for me. Time necessary to complete open(2) calls can be expensive, especially when the database is poorly designed and is touching many different parts of the database spread across multiple files on the backend. 3000 is high, but I've found 500 to be vastly too low in some cases... in others, it's just fine. My rule of thumb has become, if you're doing lots of aggregate functions (ex, SUM(), COUNT()) more than once in the lifetime of a backend, increasing this value helps.. otherwise it buys you little (if so, 1500 is generally sufficient). Faster IO, however, is going to save you here. If you can, increase your disk caching in the OS. On FreeBSD, increase your KVA_PAGES and NBUFs. Since you've freed up more ram by disabling persistent connections, this shouldn't be a problem. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] memcached and PostgreSQL
So What does memcached offer pgsql users? It would still seem to offer the benefit of a multi-machined cache. Ack, I totally missed this thread. Sorry for jumping in late. Basically, memcached and pgmemcache offer a more technically correct way of implementing query caching. MySQL's query caching is a disaster, IMHO. memcached alleviates this load from the database and puts it elsewhere in a more optimized form. The problem with memcached by itself is that you're relying on the application to invalidate the cache. How many different places have to be kept in sync? Using memcached, in its current form, makes relying on the application to be developed correctly with centralized libraries and database access routines. Bah, that's a cluster f#$@ waiting to happen. pgmemcache fixes that though so that you don't have to worry about invalidating the cache in every application/routine. Instead you just centralize that logic in the database and automatically invalidate via triggers. It's working out very well for me. I'd be interested in success stories, fwiw. In the next week or so I'll probably stick this on pgfoundry and build a proper make/release structure. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] memcached and PostgreSQL
The big concern I have about memcache is that because it controls storage external to the database there is no way to guarantee the cache is consistent with the database. I've found that letting applications add data to memcache and then letting the database replace or delete keys seems to be the best approach to minimize exactly this issue. Having two clients update the cache is risky. Using triggers or using NOTIFY + tailing logs makes this much more bullet proof. This is similar to sending email in a trigger or on commit where you can't be certain you send email always and only on a commit. While this is certainly a possibility, it's definitely closer to the exception and not the normal instance. In the database, we mark everything we do with a transaction id and mark the transaction id as committed in on operation. I see no way to do that with memcache. Correct. With an ON COMMIT trigger, it'll be easier to have a 100% accurate cache. That said, memcache does exist out side of the database so it's theoretically impossible to guarantee that the two are 100% in sync. pgmemcache goes a long way towards facilitating that the cache is in sync with the database, but it certainly doesn't guarantee it's in sync. That being said, I haven't had any instances of it not being in sync since using pgmemcache (I'm quite proud of this, to be honest *grin*). For critical operations such as financial transactions, however, I advise going to the database unless you're willing to swallow the financial cost of cache discrepancies. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] memcached and PostgreSQL
My point was that there are two failure cases --- one where the cache is slightly out of date compared to the db server --- these are cases where the cache update is slightly before/after the commit. I was thinking about this and ways to minimize this even further. Have memcache clients add data and have a policy to have the database only delete data. This sets the database up as the bottleneck again, but then you have a degree of transactionality that couldn't be previously achieved with the database issuing replace commands. For example: 1) client checks the cache for data and gets a cache lookup failure 2) client beings transaction 3) client SELECTs data from the database 4) client adds the key to the cache 5) client commits transaction This assumes that the client won't rollback or have a transaction failure. Again, in 50M transactions, I doubt one of them would fail (sure, it's possible, but that's a symptom of bigger problems: memcached isn't an RDBMS). The update case being: 1) client begins transaction 2) client updates data 3) database deletes record from memcache 4) client commits transaction 5) client adds data to memcache The second is where the cache update happens and the commit later fails, or the commit happens and the cache update never happens. Having pgmemcache delete, not replace data addresses this second issue. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Query planner plans very inefficient plans
>I have somewhere around 3M rows in the image table, and 37K rows in the >ancestry table. The following is representative of some of the common >queries I issue: > >select * from image natural join ancestry where ancestorid=100 and >(state & 7::bigint) = 0::bigint; > >When I ask postgres to EXPLAIN it, I get the following: > >Merge Join (cost=81858.22..81900.60 rows=124 width=49) > -> Sort (cost=81693.15..81693.15 rows=16288 width=41) >-> Seq Scan on image (cost=0.00..80279.17 rows=16288 width=41) > -> Sort (cost=165.06..165.06 rows=45 width=8) >-> Index Scan using ancestry_ancestorid_key on ancestry >(cost=0.00..163.83 rows=45 width=8) > >It appears to me that the query executes as follows: > >1. Scan every row in the image table to find those where (state & >7::bigint) = 0::bigint >2. Sort the results >3. Use an index on ancestry to find rows where ancestorid=100 >4. Sort the results >5. Join the two FWIW, I use INTs as bit vectors for options in various applications and have run into this in a few cases. In the database, I only care about a few bits in the options INT, so what I did was create a function for each of the bits that I care about and then a function index. Between the two, I've managed to solve my performance problems. CREATE FUNCTION app_option_foo_is_set(INT) RETURNS BOOL IMMUTABLE AS ' BEGIN IF $1 & 7::INT THEN RETURN TRUE; ELSE RETURN FALSE; END IF; END; ' LANGUAGE 'plpgsql'; CREATE INDEX app_option_foo_fidx ON app_option_tbl (app_option_foo_is_set(options)); VACUUM ANALYZE; Just make sure that you set your function to be IMMUTABLE. -sc PS It'd be slick if PostgreSQL would collapse adjacent booleans into a bit in a byte: it'd save some apps a chunk of space. 32 options == 32 bytes with the type BOOL, but if adjacent BOOLs were collapsed, it'd only be 4 bytes on disk and maybe some page header data. -- Sean Chittenden ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] Moving postgresql.conf tunables into 2003...
What are the odds of going through and revamping some of the tunables in postgresql.conf for the 7.4 release? I was just working with someone on IRC and on their 7800 RPM IDE drives, their random_page_cost was ideally suited to be 0.32: a far cry from 4. Doing so has been a win across the board and the problem query went from about 40sec (seq scan) down to 0.25ms (using idx, higher than 0.32 resulted in a query time jump to 2sec, and at 0.4 it went back up to a full seq scan at 40sec). I know Josh is working on revamping the postgresql.conf file, but would it be possible to include suggested values for various bits of hardware and then solicit contributions from admins on this list who have tuned their DB correctly? ## random_page_cost -- units are one sequential page fetch cost #random_page_cost = 4 # default - very conservative #random_page_cost = 0.9 # IDE 5200 RPM, 8MB disk cache #random_page_cost = 0.3 # IDE 7800 RPM, 4MB disk cache #random_page_cost = 0.1 # SCSI RAID 5, 10,000RPM, 64MB cache #random_page_cost = 0.05# SCSI RAID 1+0, 15,000RPM, 128MB cache #... ## next_hardware_dependent_tunable #hardware_dependent_tunable I know these tables could get somewhat lengthy or organized differently, but given the file is read _once_ at _startup_, seen by thousands of DBAs, is visited at least once for every installation (at the least to turn on TCP connections), is often the only file other than pg_hba.conf that gets modified or looked at, this could be a very nice way of introducing DBAs to tuning PostgreSQL and reducing the number of people crying "PostgreSQL's slow." Having postgresql.conf a clearing house for tunable values for various bits of hardware would be a huge win for the community and would hopefully radically change this database's perception. At the top of the file, it would be useful to include a blurb to the effect of: # The default values for PostgreSQL are extremely conservative and are # likely far from ideal for a site's needs. Included in this # configuration, however, are _suggested_ values to help aid in # tuning. The values below are not authoritative, merely contributed # suggestions from PostgreSQL DBAs and committers who have # successfully tuned their databases. Please take these values as # advisory only and remember that they will very likely have to be # adjusted according to your site's specific needs. If you have a # piece of hardware that isn't mentioned below and have tuned your # configuration aptly and have found a suggested value that the # PostgreSQL community would benefit from, please send a description # of the hardware, the name of the tunable, and the tuned value to # [EMAIL PROTECTED] to be considered for inclusion in future # releases. # # It should also go without saying that the PostgreSQL Global # Development Group and its community of committers, contributors, # administrators, and commercial supporters are absolved from any # responsibility or liability with regards to the use of its software # (see this software's license for details). Any data loss, # corruption, or performance degradation is the responsibility of the # individual or group of individuals using/managing this installation. # # Hints to DBAs: # # *) Setup a regular backup schedule (hint: pg_dump(1)/pg_dumpall(1) + #cron(8)) # # *) Tuning: Use psql(1) to test out values before changing values for #the entire database. In psql(1), type: # #1) SHOW [tunabe_name]; #2) SET [tunable_name] = [value]; #3) [run query] #4) [repeat adjustments as necessary before setting a value here in # the postgresql.conf]. #5) [Send a SIGHUP signal to the backend to have the config values # re-read] # # *) Never use kill -9 on the backend to shut it down. # # *) VACUUM ANALYZE your databases regularly. # # *) Use EXPLAIN ANALYZE [query] to tune queries. # # *) Read the online documentation at: #http://www.postgresql.org/docs/ # # -- PostgreSQL Global Development Group Just a thought. A bit lengthy, but given that out of the box most every value is set to be extremely conservative (detrimentally so, esp since the majority of users aren't running PostgreSQL in embedded devices, are on reasonably new hardware > 3 years old), and the config is only read in once and generally the only file viewed by DBAs, it'd make PostgreSQL more competitive in the performance dept if there were some kind of suggested values for various tunables. Having someone whine, "my PostgreSQL database is slow" is really getting old when its really not and it's a lack of tuning that is at fault, lowering the bar to a successful and speedy PostgreSQL installation would be a win for everyone. The person who I was helping also had the same data, schema, and query running on MySQL and the fastest it could go was 2.7s (about 40M rows in the table). -s
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
> > What are the odds of going through and revamping some of the > > tunables in postgresql.conf for the 7.4 release? I was just > > working with someone on IRC and on their 7800 RPM IDE drives, > > their random_page_cost was ideally suited to be 0.32: a far cry > > from 4. Doing so has been a win across the board and the problem > > query went from about 40sec (seq scan) down to 0.25ms (using idx, > > higher than 0.32 resulted in a query time jump to 2sec, and at 0.4 > > it went back up to a full seq scan at 40sec). > > I'm the guy who advocates settings of 1 to 2, and that still sounds > low to me. :-) I'm wondering if the effective_cache_size was set > properly, as well as there be enough buffers allocated. > > I generally set effective cache size to 100,000 pages (800 megs or > so) on my box, which is where it sits most days. with this setting > I've found that settings of under 1 are not usually necessary to > force the planner to take the path of righteousness (i.e. the > fastest one :-) 1.2 to 1.4 are optimal to me. This is a nightly report that's run, cache sizes won't impact performance of the query at all. The planner was consistently choosing a sequential scan over using the index until the random_page_cost was set to 0.32. After adjustment, the query just flies ([EMAIL PROTECTED] vs [EMAIL PROTECTED] vs. 40s@>0.4). Since it's a nightly report that only gets performed once a day and data is COPY'ed in once every few minutes, there's a huge amount of data that's not cached nor should it be. > Since theoretically a random page of of 1 means no penalty to move > the heads around, and there's ALWAYS a penalty for moving the heads > around, we have to assume: > > 1: That either the planner is making poor decisions on some other > variable, and we can whack the planner in the head with a really low > random page count. By all accounts of having played with this query+data, this is the correct assumption from what I can tell. > OR > > 2: The other settings are suboptimal (buffers, sort_mem, > effective_cache_size, etc...) and lowering random page costs helps > there. None of those other than possibly sort_mem had any impact on the query, but even then, lower sort_mem doesn't help until the data's been picked out of the table. Sorting ~16k of rows is quicker with more sort_mem. Higher sort_mem has zero impact on fetching ~16K rows out of a table with 40M rows of data. Getting the planner to pick using the index to filter out data inserted in the last 3 days over doing a seq scan... well, I don't know how you could do that without changing the random_page_cost. A good thump to the side of the head would be welcome too if I'm wrong, just make sure it's a good thump with the appropriate clue-bat. > I've always wondered if most performance issues aren't a bit of both. Eh, in my experience, it's generally that random_page_cost needs to be adjusted to match the hardware and this value every year with new hardware, seems to be getting lower. > The answer, of course, is fixing the planner so that a > random_page_cost of anything less than 1 would never be needed, > since by design, anything under 1 represents a computer that likely > doesn't exist (in theory of course.) A 1 would be a machine that > was using solid state hard drives and had the same cost in terms of > OS paths to do random accesses as sequential. Well, this could be a bug then, but I'm skeptical. What's odd to me is that hanging the value between 0.32, 0.33, and 0.4 all radically change the performance of the query. > What constants in the planner, and / or formulas would be the likely > culprits I wonder? I've wandered through that page and wasn't sure > what to play with. random_page_cost should be proportional to the seek time necessary for the disk to find a page of data on its platters. It makes sense that this value, as time progresses, gets smaller as hardware gets faster. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
> I'm curious how many of the configuration values can be determined > automatically, or with the help of some script. It seem like there > could be some perl script in contrib that could help figure this out. > Possibly you are asked a bunch of questions and then the values are > computed based on that. Something like: > > How many tables will the system have? > How much memory will be available to the postmaster? > How many backends will there typically be? > What is the avg seek time of the drive? > What's the transfer rate of the drive? > > Seems to me that a lot of reasonable default values can be figure out > from these basic questions. FSM settings, Sort Mem, Random Page Cost, > Effective Cache Size, Shared Memor, etc, etc. Someone was working on a thing called pg_autotune or some such program that'd do exactly what you're thinking of. http://archives.postgresql.org/pgsql-performance/2002-10/msg00101.php http://gborg.postgresql.org/project/pgautotune/projdisplay.php -- Sean Chittenden ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
> > What are the odds of going through and revamping some of the > > tunables in postgresql.conf for the 7.4 release? I was just > > working with someone on IRC and on their 7800 RPM IDE drives, > > their random_page_cost was ideally suited to be 0.32: a far cry > > from 4. > > I find it very very hard to believe a random read was cheaper than a > sequential read. Something is shifty in your testing. This is the procedure used to zero in on the number: SET random_page_cost = 3; [run query three times] SET random_page_cost = 2; [run query three times] SET random_page_cost = 1; [run query three times] SET random_page_cost = 0.01; -- verify that this tunable would make -- a difference eventually [run query three times] SET random_page_cost = 0.5; [run query three times] SET random_page_cost = 0.2; -- this was the 1st query that didn't -- do a seq scan [run query three times] SET random_page_cost = 0.4; -- back to a seq scan [run query three times] SET random_page_cost = 0.3; -- idx scan, how high can I push the rpc? [run query three times] SET random_page_cost = 0.35; -- interesting, the query time jumped to -- about 0.2s... better than 40s, but not as -- nice as the 0.25ms when the rpc was at 0.3 [run query three times] SET random_page_cost = 0.32; -- Sweet, 0.25ms for the query [run query three times] SET random_page_cost = 0.33; -- Bah, back up to 0.2s [run query three times] SET random_page_cost = 0.31; -- Down to 0.25ms, too low [run query three times] SET random_page_cost = 0.33; -- Double check that it wasn't an errant -- performance at 0.33 [run query three times] SET random_page_cost = 0.32; -- Double check that 0.32 is the magic number [run query three times] [edit postgresql.conf && killall -SIGHUP postmaster] -sc -- Sean Chittenden pgp0.pgp Description: PGP signature
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
Limit (cost=227274.85..227274.90 rows=20 width=12) (actual time=28.42..28.75 rows=20 loops=1) -> Sort (cost=227274.85..227316.97 rows=16848 width=12) (actual time=28.41..28.52 rows=21 loops=1) Sort Key: count(srca) -> Aggregate (cost=224828.52..226092.11 rows=16848 width=12) (actual time=20.26..28.13 rows=23 loops=1) -> Group (cost=224828.52..225670.91 rows=168478 width=12) (actual time=19.99..25.86 rows=320 loops=1) -> Sort (cost=224828.52..225249.72 rows=168478 width=12) (actual time=19.98..21.76 rows=320 loops=1) Sort Key: srca -> Index Scan using mss_fwevent_evtime_sensorid_idx on mss_fwevent (cost=0.00..210202.76 rows=168478 width=12) (actual time=0.35..17.61 rows=320 loops=1) Index Cond: ((evtime > (now() - '06:00'::interval)) AND (sensorid = 7)) Filter: (NOT "action") Total runtime: 29.09 msec (11 rows) And there 'ya have it. The times are different from when I had him send me the queries this morning, but they're within an order of magnitude difference between each and show the point. Oh, today they did a bunch of pruning of old data (nuked June's data)... the runtime differences are basically the same though. > > I know Josh is working on revamping the postgresql.conf file, but > > would it be possible to include suggested values for various bits of > > hardware and then solicit contributions from admins on this list who > > have tuned their DB correctly? > > I think such material belongs in the SGML docs, not hidden away in a > config file that people may not look at... The config file isn't hidden though and is very visible in the tuning process and to DBAs. I don't know if a PostgreSQL distributions ship with TCP connections enabled by default (FreeBSD doesn't), so the config is always seen and viewed by DBAs. If it's not the TCP connections setting, it's the max connections setting or sort_mem, etc... having the values dup'ed in the SGML, however, would be good too, but it's of most practical relevance in the actual config: as an admin setting up a DB, I'd rather not have to fish around on postgresql.org to find a recommended setting, having it inline and just having to uncomment it is by far and away the most DBA friendly and likely to be used in the wild by admins. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
> Sean Chittenden <[EMAIL PROTECTED]> writes: > > Getting the planner to pick > > using the index to filter out data inserted in the last 3 days over > > doing a seq scan... well, I don't know how you could do that without > > changing the random_page_cost. > > This sounds a *whole* lot like a correlation issue. If the data in > question were scattered randomly in the table, it's likely that an > indexscan would be a loser. The recently-inserted data is probably > clustered near the end of the table (especially if they're doing > VACUUM FULL after data purges; are they?). But the planner's > correlation stats are much too crude to recognize that situation, if > the rest of the table is not well-ordered. Data isn't scattered randomly from what I can tell and is basically already clustered just because the data is inserted linearly and based off of time. I don't think they're doing a VACUUM FULL after a purge, but I'll double check on that on Monday when they get in. Is there an easy way of determining or setting a planner stat to suggest that data is ordered around a column in a permanent way? CLUSTER has always been a one shot deal and its effects wear off quickly depending on the way that data is inserted. It seems as though that this would be a circumstance in which preallocated disk space would be a win (that way data wouldn't always be appended to the heap and could be inserted in order, of most use for non-time related data: ex, some non-unique ID). > If their typical process involves a periodic data purge and then a > VACUUM FULL, it might be worth experimenting with doing a CLUSTER on > the timestamp index instead of the VACUUM FULL. The CLUSTER would > reclaim space as effectively as VACUUM FULL + REINDEX, and it would > leave the table with an unmistakable 1.0 correlation ... which > should tilt the planner towards an indexscan without needing a > physically impossible random_page_cost to do it. I think CLUSTER > would probably be a little slower than VACUUM FULL but it's hard to > be sure without trying. Hrm, I understand what clustering does, I'm just not convinced that it'll "fix" this performance problem unless CLUSTER sets some kind of hint that ANALYZE uses to modify the way in which it collects statistics. Like I said, I'll let you know on Monday when they're back in the shop, but I'm not holding my breath. I know random_page_cost is set to something physically impossible, but in terms of performance, it's always been the biggest win for me to set this puppy quite low. Bug in the planner, or documentation surrounding what this knob does, I'm not sure, but setting this to a low value consistently yields good results for me. Faster the drive, the lower the random_page_cost value. *shrug* > That's one heck of a poor estimate for the number of rows returned. > > > -> Seq Scan on mss_fwevent (cost=0.00..223312.60 rows=168478 width=12) (actual > > time=24253.66..24319.87 rows=320 loops=1) The stats for the columns are already set to 1000 to aid with this... don't know what else I can do here. Having the planner off by as much as even half the actual size isn't uncommon in my experience. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
> > # The default values for PostgreSQL are extremely conservative and > > # are likely far from ideal for a site's needs. Included in this > > # configuration, however, are _suggested_ values to help aid in > > > # > > This sort of narrative belongs in the SGML docs, not in a CONF file. > In fact, one could argue that we should take *all* commentary out of > the CONF file in order to force people to read the docs. The SGML docs aren't in the DBA's face and are way out of the way for DBAs rolling out a new system or who are tuning the system. SGML == Developer, conf == DBA. > Database performance tuning will always be a "black art," as it > necessitates a broad knowledge of PostgreSQL, OS architecture, and > computer hardware. So I doubt that we can post docs that would > allow any 10% time DBA to make PostgreSQL "fly", but hopefully over > the next year we can make enough knowledge public to allow anyone to > make PostgreSQL "sprint". I'm highly resistant to/disappointed in this attitude and firmly believe that there are well understood algorithms that DBAs use to diagnose and solve performance problems. It's only a black art because it hasn't been documented. Performance tuning isn't voodoo, it's adjusting constraints to align with the execution of applications and we know what the applications do, therefore the database can mold to the applications' needs. Some of those parameters are based on hardware constraints and should be pooled and organized as such. random_page_cost == avg cost of a random disk seek/read (eg: disk seek time) == constant integer for a given piece of hardware There are other settings that are RAM based as well, which should be formulaic and derived though a formula hasn't been defined to date. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
> > The SGML docs aren't in the DBA's face and are way out of the way > > for DBAs rolling out a new system or who are tuning the system. > > SGML == Developer, conf == DBA. > > That's exactly my point. We cannot provide enough documentation in > the CONF file without septupling its length. IF we remove all > commentary, and instead provide a pointer to the documentation, more > DBAs will read it. Which I don't think would happen and why I think the terse bits that are included are worth while. :) > > Some of those parameters are based on hardware constraints and > > should be pooled and organized as such. > > > > random_page_cost == > > avg cost of a random disk seek/read (eg: disk seek time) == > > constant integer for a given piece of hardware > > But, you see, this is exactly what I'm talking about. > random_page_cost isn't static to a specific piece of hardware ... it > depends as well on what else is on: *) the disk/array translation: how fast data is accessed and over how many drives. *) concurrent disk activity A disk/database activity metric is different than the cost of a seek on the platters. :) Because PostgreSQL doesn't currently support such a disk concurrency metric doesn't mean that its definition should get rolled into a different number in an attempt to accommodate for a lack thereof. *) disk controller settings This class of settings falls into the same settings that affect random seeks on the platters/disk array(s). *) filesystem Again, this influences avg seek time *) OS Again, avg seek time *) distribution of records and tables This has nothing to do with PostgreSQL's random_page_cost setting other than that if data is fragmented on the platter, the disk is going to have to do a lot of seeking. This is a stat that should get set by ANALYZE, not by a human. *) arrangement of the partitions on disk Again, avg seek time. > One can certainly get a "good enough" value by benchmarking the > disk's random seek and calculating based on that ... but to get an > "ideal" value requires a long interactive session by someone with > experience and in-depth knowledge of the machine and database. An "ideal" value isn't obtained via guess and check. Checking is only the verification of some calculable set of settingsthough right now those calculated settings are guessed, unfortunately. > > There are other settings that are RAM based as well, which should > > be formulaic and derived though a formula hasn't been defined to > > date. > > You seem pretty passionate about this ... how about you help me an > Kevin define a benchmarking suite when I get back into the country > (July 17)? If we're going to define formulas, it requires that we > have a near-comprehensive and consistent test database and test > battery that we can run on a variety of machines and platforms. Works for me, though a benchmark will be less valuable than adding a disk concurrency stat, improving data trend/distribution analysis, and using numbers that are concrete and obtainable through the OS kernel API or an admin manually plunking numbers in. I'm still recovering from my move from Cali to WA so with any luck, I'll be settled in by then. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
> I don't have much to add because I'm pretty new to Postgres and have > been soliciting advice here recently, but I totally agree with > everything you said. I don't mind if it's in the postgres.conf file > or in a faq that is easy to find, I just would like it to be in one > place. A good example of the need for this is when I was tuning > "effective_cache" I thought that was creating a cache for Postgres > when in fact as it was pointed out to me, it's just hinting to > postgres the size of the OS cache. Lots of ways for people to get > really confused here. I looked through the src/doc/runtime.sgml for a good place to stick this and couldn't find a place that this seemed appropriate, but on FreeBSD, this can be determined with a great deal of precision in a programmatic manner: echo "effective_cache_size = $((`sysctl -n vfs.hibufspace` / 8192))" The same OID is available via C too. It'd be slick if PostgreSQL could tune itself (on FreeBSD) at initdb time with the above code. If Linux exports this info via /proc and can whip out the appropriate magic, even better. An uncommented out good guess that shows up in postgresql.conf would be stellar and quite possible with the use of sed. Maybe an initdb switch could be added to have initdb tune the config it generates? If a -n is added, have it generate a config and toss it to stdout? case `uname` in "FreeBSD") echo "effective_cache_size = $((`sysctl -n vfs.hibufspace` / 8192))" ;; *) echo "Unable to automatically determine the effective cache size" >> /dev/stderr ;; esac -sc -- Sean Chittenden ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
N ANALYZE SELECT srca, COUNT(srca) FROM mss_fwevent WHERE mss_masterlog-> sensorid = 7 AND evtime > (now() - '6 hours'::INTERVAL) mss_masterlog-> AND NOT action GROUP BY srca ORDER BY COUNT DESC LIMIT 20; QUERY PLAN - -- Limit (cost=227274.85..227274.90 rows=20 width=12) (actual time=28.42..28.75 rows=20 loops=1) -> Sort (cost=227274.85..227316.97 rows=16848 width=12) (actual time=28.41..28.52 rows=21 loops=1) Sort Key: count(srca) -> Aggregate (cost=224828.52..226092.11 rows=16848 width=12) (actual time=20.26..28.13 rows=23 loops=1) -> Group (cost=224828.52..225670.91 rows=168478 width=12) (actual time=19.99..25.86 rows=320 loops=1) -> Sort (cost=224828.52..225249.72 rows=168478 width=12) (actual time=19.98..21.76 rows=320 loops=1) Sort Key: srca -> Index Scan using mss_fwevent_evtime_sensorid_idx on mss_fwevent (cost=0.00..210202.76 rows=168478 width=12) (actual time=0.35..17.61 rows=320 loops=1) Index Cond: ((evtime > (now() - '06:00'::interval)) AND (sensorid = 7)) Filter: (NOT "action") Total runtime: 29.09 msec (11 rows) -- Sean Chittenden ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
> >> I have an experimental patch lying around somewhere that tries to > >> work around these problems by offering different estimation methods > >> for index scans. If you are interested, I'll dig it out. > > > >Sure, I'll take a gander... had my head in enough Knuth recently to > >even hopefully have some kind of a useful response to the patch. > > Sean, the patch is at http://www.pivot.at/pg/16-correlation-732.diff. > A short description of its usage can be found at > http://archives.postgresql.org/pgsql-performance/2002-11/msg00256.php. > If you are interested how the different interpolation methods work, > read the source - it shouldn't be too hard to find. > > You might also want to read the thread starting at > http://archives.postgresql.org/pgsql-hackers/2002-10/msg00072.php. Hrm... let me bop back in my archives and reply there... very interesting work though. I hope a reasonable algorythm can be found in time for 7.5, or even 7.4 as this seems to be biting many people and the current algo is clearly not right. > >> does this mean that sensorid, evtime, and action are not > >> independent? > > > >Hrm... sensorid is sequence and grows proportional with evtime, > >obviously. > > So a *low* sensorid (7) is quite uncommon for a *late* evtime? This > would help understand the problem. Unfortunately I have no clue what > to do about it. :-( Correct. > >Having spent a fair amount of time looking at the two following plans, > >it seems as though an additional statistic is needed to change the > >cost of doing an index lookup when the index is linearly ordered. > > I'm not sure I understand what you mean by "index is linearly > ordered", but I guess correlation is that statistic you are talking > about. However, it is calculated per column, not per index. If two rows are id's 123456 and 123457, what are the odds that the tuples are going to be on the same page? ie, if 123456 is read, is 123457 already in the OS or PostgreSQL's disk cache? > >Whether CLUSTER does this or not, I don't know, > > If you CLUSTER on an index and then ANALYSE, you get a correlation of > 1.0 (== optimum) for the first column of the index. Correlating of what to what? Of data to nearby data? Of data to related data (ie, multi-column index?)? Of related data to pages on disk? Not 100% sure in what context you're using the word correlation... But that value will degrade after time and at what rate? Does ANALYZE maintain that value so that it's kept acurrate? The ANALYZE page was lacking in terms of implementation details in terms of how many rows ANALYZE actually scans on big tables, which could dramatically affect the correlation of a table after time if ANALYZE is maintaining the correlation for a column. > > I never heard back from him after getting the runtime down to a > > few ms. :-/ > > Pity! I'd have liked to see EXPLAIN ANALYSE for > > SELECT * > FROM mss_fwevent >WHERE sensorid = 7 > AND evtime > (now() - '6 hours'::INTERVAL) > AND NOT action; > > SELECT * > FROM mss_fwevent >WHERE sensorid = 7 > AND evtime > (now() - '6 hours'::INTERVAL); > > SELECT * > FROM mss_fwevent >WHERE evtime > (now() - '6 hours'::INTERVAL); > > SELECT * > FROM mss_fwevent >WHERE sensorid = 7; ditto > > Are indexes > >on linearly ordered data rebalanced somehow? I thought CLUSTER only > >reordered data on disk. -sc > > AFAIK CLUSTER re-creates all indices belonging to the table. As of 7.3 or 7.4, yes. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Perfomance Tuning
> > > Well, yeah. But given the Linux propensity for introducing major > > > features in "minor" releases (and thereby introducing all the > > > attendant bugs), I'd think twice about using _any_ Linux feature > > > until it's been through a major version (e.g. things introduced in > > > 2.4.x won't really be stable until 2.6.x) -- and even there one is > > > taking a risk[1]. > > > > Dudes, seriously - switch to FreeBSD :P > > Yeah, it's nice to have a BUG FREE OS huh? ;^) > > And yes, I've used FreeBSD, it's quite good, but I kept getting the > feeling it wasn't quite done. Especially the installation > documentation. While the handbook isn't the same as reading the actual source or the only FreeBSD documentation, it certainly is quite good (to the point that publishers see small market to publish FreeBSD books because the documentation provided by the project is so good), IMHO. http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/ If anyone on this list has any issues with the documentation, please take them up with me _privately_ and I will do my best to either address or correct the problem. Now, back to our regularly scheduled and on topic programming... -sc -- Sean Chittenden "(PostgreSQL|FreeBSD).org - The Power To Serve" ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Perfomance Tuning
> Andrew Sullivan's fairly regular response is that he tried (albeit > not VASTLY extensively) to distinguish between disks when working > with fibre arrays, and he couldn't measure an improvement in > shifting WAL (the OBVIOUS thing to shift) to separate disks. Real quick... the faster the drives, the less important it is to move WAL onto a different drive. The slower the drives, the more important this is... which is why this isn't as necessary (if at all) for large production environments. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] The results of my PostgreSQL/filesystem performance tests
> What it still leaves quite open is just what happens when the OS has > more than one disk drive or CPU to play with. It's not clear what > happens in such cases, whether FreeBSD would catch up, or be "left > further in the dust." The traditional "propaganda" has been that > there are all sorts of reasons to expect PostgreSQL on FreeBSD to > run a bit faster than on Linux; it is a bit unexpected for the > opposite to seem true. Let me nip this in the butt before people run away with ideas that aren't correct. When the tests were performed in FreeBSD 5.1 and Linux, the hard drives were running UDMA. When running 4.8, for some reason his drives settled in on PIO mode: ad0s1a: UDMA ICRC error writing fsbn 1458368 of 729184-729215 (ad0s1 bn 1458368; cn 241 tn 12 sn 44) retrying ad0s1a: UDMA ICRC error writing fsbn 1458368 of 729184-729215 (ad0s1 bn 1458368; cn 241 tn 12 sn 44) retrying ad0s1a: UDMA ICRC error writing fsbn 1458368 of 729184-729215 (ad0s1 bn 1458368; cn 241 tn 12 sn 44) retrying ad0s1a: UDMA ICRC error writing fsbn 1458368 of 729184-729215 (ad0s1 bn 1458368; cn 241 tn 12 sn 44) falling back to PIO mode The benchmarks were hardly conclusive as UDMA runs vastly faster than PIO. Until we hear back as to whether cables were jarred loose between the tests or hearing if something else changed, I'd hardly consider these conclusive tests given PIO/UDMA is apples to oranges in terms of speed and I fully expect that FreeBSD 4.8 will perform at least faster than 5.1 (5.x is still being unwound from Giant), but should out perform Linux as well if industry experience iss any indicator. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] The results of my PostgreSQL/filesystem performance tests
> I need to step in and do 2 things: Thanks for posting that. Let me know if you have any questions while doing your testing. I've found that using 16K blocks on FreeBSD results in about an 8% speedup in writes to the database, fwiw. I'm likely going to make this the default for PostgreSQL on FreeBSD starting with 7.4 (just posted something to -hackers about this)f. If you'd like to do this in your testing, just apply the following patch. Right now PostgreSQL defaults to 8K blocks, but FreeBSD uses 16K blocks which means that currently, reading two blocks of data in PG is two read calls to the OS, one reads 16K of data off disk and returns the 1st page, the 2nd call pulls the 2nd block from the FS cache. In making things 16K, it avoids the need for the 2nd system call which is where the performance difference is coming from, afaikt. -sc -- Sean Chittenden Index: src/include/pg_config_manual.h === RCS file: /home/ncvs/pgsql/pgsql-server/src/include/pg_config_manual.h,v retrieving revision 1.5 diff -u -r1.5 pg_config_manual.h --- src/include/pg_config_manual.h 4 Aug 2003 00:43:29 - 1.5 +++ src/include/pg_config_manual.h 27 Aug 2003 17:40:12 - @@ -23,7 +23,7 @@ * * Changing BLCKSZ requires an initdb. */ -#define BLCKSZ 8192 +#define BLCKSZ 16384 /* * RELSEG_SIZE is the maximum number of blocks allowed in one disk ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] The results of my PostgreSQL/filesystem performance tests
> >> I need to step in and do 2 things: > SC> Thanks for posting that. Let me know if you have any questions while > SC> doing your testing. I've found that using 16K blocks on FreeBSD > SC> results in about an 8% speedup in writes to the database, fwiw. > > ok.. ignore my prior request about how to set that... i missed you > had included a patch. > > Any recommendations on newfs parameters for an overly large file > system used solely for Postgres? Over 100Gb (with raid 10) or over > 200Gb (with raid 5)? Nope, you'll have to test and see. If you find something that works, however, let me know. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] bad estimates
> >If you want both the max and the min, then things are going to be a > >bit more work. You are either going to want to do two separate > >selects or join two selects or use subselects. If there aren't > >enough prices per stock, the sequential scan might be fastest since > >you only need to go through the table once and don't have to hit > >the index blocks. > > > >It is still odd that you didn't get a big speed up for just the min though. > > I found I'm suffering from an effect detailed in a previous thread titled > > Does "correlation" mislead the optimizer on large tables? I don't know about large tables, but this is a big problem and something I'm going to spend some time validating later today. I think Manfred's patch is pretty good and certainly better than where we are but I haven't used it yet to see if it's the magic ticket for many of these index problems. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] bad estimates
> >>I found I'm suffering from an effect detailed in a previous thread titled > >> > >>Does "correlation" mislead the optimizer on large tables? > > > > > >I don't know about large tables, but this is a big problem and > >something I'm going to spend some time validating later today. I > >think Manfred's patch is pretty good and certainly better than where > >we are but I haven't used it yet to see if it's the magic ticket for > >many of these index problems. > > I had to dig through a lot of archives to find this. Is this the patch, > from last October? > > http://members.aon.at/pivot/pg/16-correlation.diff > > If so, I'll try it out and report my results. Same guy, but that patch is pretty out of date and has been replaced by some newer work that's much better. From: Manfred Koizar <[EMAIL PROTECTED]> Cc: [EMAIL PROTECTED] Subject: Re: [HACKERS] Correlation in cost_index() Date: Wed, 20 Aug 2003 19:57:12 +0200 Message-ID: <[EMAIL PROTECTED]> and From: Manfred Koizar <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Subject: [HACKERS] Again on index correlation Date: Wed, 20 Aug 2003 21:21:14 +0200 Message-ID: <[EMAIL PROTECTED]> -sc -- Sean Chittenden ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] FreeBSD page size
> Ok... simple tests have completed. Here are some numbers. > > FreeBSD 4.8 > PG 7.4b2 > 4GB Ram > Dual Xeon 2.4GHz processors > 14 U320 SCSI disks attached to Dell PERC3/DC RAID controller in RAID 5 > config with 32k stripe size [snip] > Then I took the suggestion to update PG's page size to 16k and did the > same increase on sort_mem and checkpoint_segments as above. I also > halved the shared_buffers and max_fsm_pages (probably should have > halved the effective_cache_size too...) > > restore time: 11322 seconds > vacuum analyze time: 27 minutes > select count(*) from user_list where owner_id=315; 48267.66 ms > > > Granted, given this simple test it is hard to say whether the 16k > blocks will make an improvement under live load, but I'm gonna give it > a shot. The 16k block size shows me roughly 2-6% improvement on these > tests. > > So throw in my vote for 16k blocks on FreeBSD (and annotate the docs > to tell which parameters need to be halved to account for it). I haven't had a chance to run any tests yet (ELIFE), but there was a suggestion that 32K blocks was a better performer than 16K blocks (!!??!!??). I'm not sure why this is and my only guess is that it relies more heavily on the disk cache to ease IO. Since you have the hardware setup, Vivek, would it be possible for you to run a test with 32K blocks? I've started writing a threaded benchmarking program called pg_crush that I hope to post here in a few days that'll time connection startup times, INSERTs, DELETEs, UPDATEs, and both sequential scans as well as index scans for random and sequentially ordered tuples. It's similar to pgbench, except it generates its own data, uses pthreads (chears on KSE!), and returns more fine grained timing information for the various activities. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] PostgreSQL Reliability when fsync = false on Linux-XFS
> Just wonderin. What if you symlink WAL to a directory which is on > mounted USB RAM drive? USB 2.0 you mean? It supposedly runs at 1394 speeds, but USB 1.0/1.1 runs at 1MB/s under ideal circumstances... that's slower than even old IDE drives. > Will that increase any throughput? Probably not... > I am sure a 256/512MB flash drive will cost lot less than a SCSI > disk. May be even a GB on flash drive would do.. That's true... but on a per $$/MB, you're better off investing in RAM and increasing your effective_cache_size. If dd to a flash card is faster than to an IDE drive, please let me know. :) -sc -- Sean Chittenden UNIX(TM), a BSD like Operating System ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Performance issue
> My statistics(Athlon 1.8Ghz) > > 20,000 itemsTakes on average 0.078seconds/room > 385,000 items Takes on average .11seconds/room > 690,000 items takes on average .270seconds/room > 1,028,000 items Takes on average .475seconds/room [snip] > I am running Red hat 8. Some of my conf entries that I have changed > follow > shared_buffers = 3700 > effective_cache_size = 4000 > sort_mem = 32168 Have you twiddled with your wal_buffers or checkpoint_segments? Might be something to look at. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] count(*) slow on large tables
> How it will help? This is in addition to trigger proposal that came > up earlier. With triggers it's not possible to make values visible > across backends unless trigger updates a table, which eventually > leads to vacuum/dead tuples problem. > > 1. User creates a trigger to check updates/inserts for certain conditions. > 2. It updates the count as and when required. > 3. If the trigger detects the count is not initialized, it would issue the > same query first time. There is no avoiding this issue. > > Besides providing facility of resident variables could be used > imaginatively as well. > > Does this make sense? IMO this is more generalised approach over all. I do this _VERY_ frequently in my databases, only I have my stored procs do the aggregate in a predefined MVCC table that's always there. Here's a denormalized version for public consumption/thought: CREATE TABLE global.dba_aggregate_cache ( dbl TEXT NOT NULL,-- The database location, doesn't need to be -- qualified (ex: schema.table.col) op TEXT NOT NULL, -- The operation, SUM, COUNT, etc. qual TEXT,-- Any kind of conditional, such as a where clause val_int INT, -- Whatever the value is, of type INT val_bigint BIGINT,-- Whatever the value is, of type BIGINT val_text TEXT,-- Whatever the value is, of type TEXT val_bytea BYTEA, -- Whatever the value is, of type BYTEA ); CREATE UNIQUE INDEX dba_aggregate_cache_dbl_op_udx ON global.dba_aggregate_cache(dbl,op); Then, I use a function to retrieve this value instead of a SELECT COUNT(*). SELECT public.cache_count('dbl','qual'); -- In this case, the op is COUNT SELECT public.cache_count('dbl'); -- Returns the COUNT for the table listed in the dbl Then, I create 4 or 5 functions (depends on the op I'm performing): 1) A private function that _doesn't_ run as security definer, that populates the global.dba_aggregate_cache row if it's empty. 2) A STABLE function for SELECTs, if the row doesn't exist, then it calls function #1 to populate its existence. 3) A STABLE function for INSERTs, if the row doesn't exist, then it calls function #1 to populate its existence, then adds the necessary bits to make it accurate. 4) A STABLE function for DELETEs, if the row doesn't exist, then it calls function #1 to populate its existence, then deletes the necessary bits to make it accurate. 5) A STABLE function for UPDATEs, if the row doesn't exist, then it calls function #1 to populate its existence, then updates the necessary bits to make it accurate. It's not uncommon for me to not have an UPDATE function/trigger. Create triggers for functions 2-5, and test away. It's MVCC, searching through a table that's INDEX'ed for a single row is obviously vastly faster than a seqscan/aggregate. If I need any kind of an aggregate to be fast, I use this system with a derivation of the above table. The problem with it being that I have to retrain others to use cache_count(), or some other function instead of using COUNT(*). That said, it'd be nice if there were a way to tell PostgreSQL to do the above for you and teach COUNT(*), SUM(*), or other aggregates to use an MVCC backed cache similar to the above. If people want their COUNT's to be fast, then they have to live with the INSERT, UPDATE, DELETE cost. The above doesn't work with anything complex such as join's, but it's certainly a start and I think satisfies everyone's gripes other than the tuple churn that _does_ happen (*nudge nudge*, pg_autovacuum could be integrated into the backend to handle this). Those worried about performance, the pages that are constantly being recycled would likely stay in disk cache (PG or the OS). There's still some commit overhead, but still... no need to over optimize by requiring the table to be stored in the out dated, slow, and over used shm (also, *nudge nudge*). Anyway, let me throw that out there as a solution that I use and it works quite well. I didn't explain the use of the qual column, but I think those who grasp the above way of handling things probably grok how to use the qual column in a dynamically executed query. CREATE AGGREGATE CACHE anyone? -sc -- Sean Chittenden ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] go for a script! / ex: PostgreSQL vs. MySQL
> Yeah, I had similar thought to Oliver's and suspected that this > would be the answer. Also, while it's not too hard to do this for a > single platform, it gets complecated once you start looking at > different ones. > > Josh, let me know when you're ready to do this. I'll try to help, > although my perl's kind of rusty. Also, can you even assume perl for > a postgres install? Does Solaris, for instance come with perl? Um, why not wait until the C version of initdb is committed, then steak out a section that'll allow us to submit patches to have initdb autotune to our hearts content? There's a tad bit of precedence with having shared buffer's automatically set in initdb, why not continue with it? I know under FreeBSD initdb will have some #ifdef's to wrap around the syscall sysctl() to get info about kernel bits. Talking about how to expand handle this gracefully for a gazillion different platforms might be a more useful discussion at this point because I'm sure people from their native OS will be able to contrib the necessary patches to extract info from their OS so that initdb can make useful decisions. Or, lastly, does anyone think that this should be in a different, external program? -sc -- Sean Chittenden ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] go for a script! / ex: PostgreSQL vs. MySQL
> NB> So far: > > NB> shared_buffers = 1/16th of total memory > NB> effective_cache_size = 80% of the supposed kernel cache. > > Please take into account the blocksize compiled into PG, too... Would anyone object to a patch that exports the blocksize via a readonly GUC? Too many tunables are page dependant, which is infuriating when copying configs from DB to DB. I wish pgsql had some notion of percentages for values that end with a '%'. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] go for a script! / ex: PostgreSQL vs. MySQL
> >NB> shared_buffers = 1/16th of total memory > >NB> effective_cache_size = 80% of the supposed kernel cache. > > I think Sean(?) mentioned this one for FreeBSD (Bash code): sh, not bash. :) > echo "effective_cache_size = $((`sysctl -n vfs.hibufspace` / 8192))" > > I've used it for my dedicated servers. Is this calculation correct? Yes, or it's real close at least. vfs.hibufspace is the amount of kernel space that's used for caching IO operations (minus the necessary space taken for the kernel). If you're real paranoid, you could do some kernel profiling and figure out how much of the cache is actually disk IO and multiply the above by some percentage, say 80%? I haven't found it necessary to do so yet. Since hibufspace is all IO and caching any net activity is kinda pointless and I assume that 100% of it is used for a disk cache and don't use a multiplier. The 8192, however, is the size of a PG page, so, if you tweak PG's page size, you have to change this constant (*grumbles*). -sc -- Sean Chittenden ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] go for a script! / ex: PostgreSQL vs. MySQL
> >> echo "effective_cache_size = $((`sysctl -n vfs.hibufspace` / 8192))" > >> > >> I've used it for my dedicated servers. Is this calculation correct? > > SC> Yes, or it's real close at least. vfs.hibufspace is the amount > of SC> kernel space that's used for caching IO operations (minus the > > I'm just curious if anyone has a tip to increase the amount of > memory FreeBSD will use for the cache? Recompile your kernel with BKVASIZE set to 4 times its current value and double your nbuf size. According to Bruce Evans: "Actually there is a way: the vfs_maxbufspace gives the amount of space reserved for buffer kva (= nbuf * BKVASIZE). nbuf is easy to recover from this, and the buffer kva space may be what is wanted anyway." [snip] "I've never found setting nbuf useful, however. I want most parametrized sizes including nbuf to scale with resource sizes, and it's only with RAM sizes of similar sizes to the total virtual address size that its hard to get things to fit. I haven't hit this problem myself since my largest machine has only 1GB. I use an nbuf of something like twice the default one, and a BKVASIZE of 4 times the default. vfs.maxbufspace ends up at 445MB on the machine with 1GB, so it is maxed out now." YMMV. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] go for a script! / ex: PostgreSQL vs. MySQL
> > > PostgreSQL requires some more shared memory to cache some > > > tables, x Mb, do you want to increase your OS kernel parameters? > > > > > >Tweak shmmax and shmmall > > > > Note that this still requires a kernel recompile on FreeBSD :( > > Not our fault, now is it? This would mean that we wouldn't be able > to script for FreeBSD. Bug the FreeBSD developers. And if you do so, you're going to hear that shm* is an antiquated interface that's dated, slow, inefficient and shouldn't be used. :) Every few months one of the uber core BSD hackers threatens to rewrite that part of PG because high up in the BSD camp, it's common belief that shm* is a source of performance loss for PostgreSQL. One of these days it'll happen, probably with mmap() mmap()'ing MAP_SHARED files stored in a $PGDATA/data/shared dir as mmap() is by far and away the fastest shared memory mechanism and certainly is very widely deployed (I would be surprised if any of the supported PG platforms didn't have mmap()). -sc -- Sean Chittenden ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] go for a script! / ex: PostgreSQL vs. MySQL
> >> This would be parameters such as the block size and a few > >> other compile time parameters. If we can get to some of these > >> read-only parameters than that would make this step easier, > >> certainly for the new recruits amongst us. > > > > Actually, from my perspective, we shouldn't bother with this; if an admin > > knows enough to set an alternate blaock size for PG, then they know > > enough to tweak the Conf file by hand. I think we should just issue a > > warning that this script: > > 1) does not work for anyone who is using non-default block sizes, > > There was some talk, either on this list or freebsd-performance > about setting the default block size for PostgreSQL running on > FreeBSD to be 16k because of performance reasons. That is: *default* > for the port, user is not asked. Real quick, this isn't true, the block size is tunable, but does not change the default. You can set PGBLOCKSIZE to the values "16K" or "32K" to change the block size, but the default remains 8K. http://lists.freebsd.org/pipermail/freebsd-database/2003-October/000111.html -sc -- Sean Chittenden ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] PostgreSQL 7.4beta5 vs MySQL 4.0.16 with RT(DBIx::SearchBuilder)
> So its not just PostgreSQL that is suffering from the bad SQL but > MySQL also. But the question is my does PostgreSQL suffer so badly > ?? I think not all developers write very nice SQLs. > > Its really sad to see that a fine peice of work (RT) is performing > sub-optimal becoz of malformed SQLs. [ specially on database of my > choice ;-) ] Post EXPLAIN ANALYZES of the queries you're running, then maybe you'll be able to get some useful help from this list. Until then, it's very hard to speculate as to why PostgreSQL is slower. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]