[PERFORM] Fragmentation of WAL files
I was recently running defrag on my windows/parallels VM and noticed a bunch of WAL files that defrag couldn't take care of, presumably because the database was running. What's disturbing to me is that these files all had ~2000 fragments. Now, this was an EnterpriseDB database which means the WAL files were 64MB instead of 16MB, but even having 500 fragments for a 16MB WAL file seems like it would definitely impact performance. Can anyone else confirm this? I don't know if this is a windows-only issue, but I don't know of a way to check fragmentation in unix. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] seeking advise on char vs text or varchar in search table
On Apr 23, 2007, at 7:16 AM, Merlin Moncure wrote: On 4/20/07, chrisj <[EMAIL PROTECTED]> wrote: I have a table that contains a column for keywords that I expect to become quite large and will be used for web searches. I will either index the column or come up with a simple hashing algorithm add the hash key to the table and index that column. I am thinking the max length in the keyword column I need to support is 30, but the average would be less than10 Any suggestions on whether to use char(30), varchar(30) or text, would be appreciated. I am looking for the best performance option, not necessarily the most economical on disk. Don't use char...it pads out the string to the length always. It also has no real advantage over varchar in any practical situation. Think of varchar as text with a maximum length...its no faster or slower but the database will throw out entries based on length (which can be good or a bad thing)...in this case, text feels better. AIUI, char, varchar and text all store their data in *exactly* the same way in the database; char only pads data on output, and in the actual tables it still contains the regular varlena header. The only reason I've ever used char in other databases is to save the overhead of the variable-length information, so I recommend to people to just steer clear of char in PostgreSQL. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] What`s wrong with JFS configuration?
On Apr 25, 2007, at 8:51 AM, Paweł Gruszczyński wrote: where u6 stores Fedora Core 6 operating system, and u0 stores 3 partitions with ext2, ext3 and jfs filesystem. Keep in mind that drives have a faster data transfer rate at the outer-edge than they do at the inner edge, so if you've got all 3 filesystems sitting on that array at the same time it's not a fair test. I heard numbers on the impact of this a *long* time ago and I think it was in the 10% range, but I could be remembering wrong. You'll need to drop each filesystem and create the next one to get a fair comparison. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] What`s wrong with JFS configuration?
The outer track / inner track performance ratio is more like 40 percent. Recent example is 78MB/s outer and 44MB/s inner for the new Seagate 750MB drive (see http://www.storagereview.com for benchmark results) - Luke Msg is shrt cuz m on ma treo -Original Message- From: Jim Nasby [mailto:[EMAIL PROTECTED] Sent: Thursday, April 26, 2007 03:53 AM Eastern Standard Time To: Pawel Gruszczynski Cc: pgsql-performance@postgresql.org Subject:Re: [PERFORM] What`s wrong with JFS configuration? On Apr 25, 2007, at 8:51 AM, Pawel Gruszczynski wrote: > where u6 stores Fedora Core 6 operating system, and u0 stores 3 > partitions with ext2, ext3 and jfs filesystem. Keep in mind that drives have a faster data transfer rate at the outer-edge than they do at the inner edge, so if you've got all 3 filesystems sitting on that array at the same time it's not a fair test. I heard numbers on the impact of this a *long* time ago and I think it was in the 10% range, but I could be remembering wrong. You'll need to drop each filesystem and create the next one to get a fair comparison. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] seeking advise on char vs text or varchar in search table
"Jim Nasby" <[EMAIL PROTECTED]> writes: > AIUI, char, varchar and text all store their data in *exactly* the same way in > the database; char only pads data on output, and in the actual tables it > still > contains the regular varlena header. The only reason I've ever used char in > other databases is to save the overhead of the variable-length information, > so > I recommend to people to just steer clear of char in PostgreSQL. Everything you said is correct except that char actually pads its data on input, not output. This doesn't actually make a lot of sense since we're storing it as a varlena so we could pad it on output and modify the data type functions to pretend the spaces are there without storing them. However it would only make a difference if you're storing variable length data in a char field in which case I would 100% agree with your conclusion and strongly recommend using varchar. The only reason I would think of using char is when the data should always be the same length, like a SSN or md5hash or something like that. In which case it's purely for the self-documenting notational convenience, not any performance reason. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] What`s wrong with JFS configuration?
Jim Nasby wrote: On Apr 25, 2007, at 8:51 AM, Paweł Gruszczyński wrote: where u6 stores Fedora Core 6 operating system, and u0 stores 3 partitions with ext2, ext3 and jfs filesystem. Keep in mind that drives have a faster data transfer rate at the outer-edge than they do at the inner edge [...] I've been wondering from time to time if partitions position can be a (probably modest, of course) performance gain factor. If I create a partition at the beginning or end of the disk, is this going to have a determined platter physical position? I remember having heard that every manufacturer has its own allocation logic. Has anyone got some information, just for curiosity? -- Cosimo ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[PERFORM] Fw: PostgreSQL Performance Tuning
Dear, We are facing performance tuning problem while using PostgreSQL Database over the network on a linux OS. Our Database consists of more than 500 tables with an average of 10K records per table with an average of 20 users accessing the database simultaneously over the network. Each table has indexes and we are querying the database using Hibernate. The biggest problem is while insertion, updating and fetching of records, ie the database performance is very slow. It take a long time to respond in the above scenario. Please provide me with the tuning of the database. I am attaching my postgresql.conf file for the reference of our current configuration Please replay me ASAP Regards, Shohab Abdullah Software Engineer, Manufacturing SBU-POWAI Larsen and Toubro Infotech Ltd.| 4th floor, L&T Technology Centre, Saki Vihar Road, Powai, Mumbai-400072 (: +91-22-67767366 | (: +91-9870247322 Visit us at : http://www.lntinfotech.com ”I cannot predict future, I cannot change past, I have just the present moment, I must treat it as my last" The information contained in this email has been classified: [ X] L&T Infotech General Business [] L&T Infotech Internal Use Only [] L&T Infotech Confidential [] L&T Infotech Proprietary This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. __ gifW11cBC4uxn.gif Description: GIF image gifsNzEiRPKHP.gif Description: GIF image jpgRn4ea2AxTR.jpg Description: JPEG image postgresql.conf Description: Binary data ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Fw: PostgreSQL Performance Tuning
Please try to keep postings to one mailing list - I've replied to the performance list here. Shohab Abdullah wrote: Dear, We are facing performance tuning problem while using PostgreSQL Database over the network on a linux OS. Our Database consists of more than 500 tables with an average of 10K records per table with an average of 20 users accessing the database simultaneously over the network. Each table has indexes and we are querying the database using Hibernate. The biggest problem is while insertion, updating and fetching of records, ie the database performance is very slow. It take a long time to respond in the above scenario. Please provide me with the tuning of the database. I am attaching my postgresql.conf file for the reference of our current configuration You haven't provided any details on what version of PG you are using, what hardware you are using, whether there is a specific bottleneck (disk, memory, cpu) or certain queries. Without that, no-one can suggest useful settings. You might find this document a good place to start: http://www.powerpostgresql.com/PerfList/ HTH -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] [GENERAL] Fw: PostgreSQL Performance Tuning
Please try to post to one list at a time. I've replied to this on the -performance list. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Fragmentation of WAL files
In response to Jim Nasby <[EMAIL PROTECTED]>: > I was recently running defrag on my windows/parallels VM and noticed > a bunch of WAL files that defrag couldn't take care of, presumably > because the database was running. What's disturbing to me is that > these files all had ~2000 fragments. Now, this was an EnterpriseDB > database which means the WAL files were 64MB instead of 16MB, but > even having 500 fragments for a 16MB WAL file seems like it would > definitely impact performance. I don't know about that. I've seen marketing material that claims that modern NTFS doesn't suffer performance problems from fragmentation. I've never tested it myself, but my point is that you might want to do some experiments -- you might find out that it doesn't make any difference. If it does, you should be able to stop the DB, defragment the files, then start the DB back up. Since WAL files are recycled, they shouldn't fragment again -- unless I'm missing something. If that works, it may indicate that (on Windows) a good method for installing is to create all the necessary WAL files as empty files before launching the DB. > Can anyone else confirm this? I don't know if this is a windows-only > issue, but I don't know of a way to check fragmentation in unix. I can confirm that it's only a Windows problem. No UNIX filesystem that I'm aware of suffers from fragmentation. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ [EMAIL PROTECTED] Phone: 412-422-3463x4023 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Fragmentation of WAL files
Bill Moran wrote: In response to Jim Nasby <[EMAIL PROTECTED]>: I was recently running defrag on my windows/parallels VM and noticed a bunch of WAL files that defrag couldn't take care of, presumably because the database was running. What's disturbing to me is that these files all had ~2000 fragments. Now, this was an EnterpriseDB database which means the WAL files were 64MB instead of 16MB, but even having 500 fragments for a 16MB WAL file seems like it would definitely impact performance. I don't know about that. I've seen marketing material that claims that modern NTFS doesn't suffer performance problems from fragmentation. I've never tested it myself, but my point is that you might want to do some experiments -- you might find out that it doesn't make any difference. If it does, you should be able to stop the DB, defragment the files, then start the DB back up. Since WAL files are recycled, they shouldn't fragment again -- unless I'm missing something. If that works, it may indicate that (on Windows) a good method for installing is to create all the necessary WAL files as empty files before launching the DB. If that turns out to be a problem, I wonder if it would help to expand the WAL file to full size with ftruncate or something similar, instead of growing it page by page. Can anyone else confirm this? I don't know if this is a windows-only issue, but I don't know of a way to check fragmentation in unix. I can confirm that it's only a Windows problem. No UNIX filesystem that I'm aware of suffers from fragmentation. What do you mean by suffering? All filesystems fragment files at some point. When and how differs from filesystem to filesystem. And some filesystems might be smarter than others in placing the fragments. There's a tool for Linux in the e2fsprogs package called filefrag that shows the fragmentation of a file, but I've never used it myself. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Filesystem fragmentation (Re: [PERFORM] Fragmentation of WAL files)
In response to Heikki Linnakangas <[EMAIL PROTECTED]>: [snip] > >> Can anyone else confirm this? I don't know if this is a windows-only > >> issue, but I don't know of a way to check fragmentation in unix. > > > > I can confirm that it's only a Windows problem. No UNIX filesystem > > that I'm aware of suffers from fragmentation. > > What do you mean by suffering? All filesystems fragment files at some > point. When and how differs from filesystem to filesystem. And some > filesystems might be smarter than others in placing the fragments. To clarify my viewpoint: To my knowledge, there is no Unix filesystem that _suffers_ from fragmentation. Specifically, all filessytems have some degree of fragmentation that occurs, but every Unix filesystem that I am aware of has built-in mechanisms to mitigate this and prevent it from becoming a performance issue. > There's a tool for Linux in the e2fsprogs package called filefrag that > shows the fragmentation of a file, but I've never used it myself. Interesting. However, the existence of a tool does not particularly indicated the _need_ for said tool. It might just have been something cool that somebody wrote. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ [EMAIL PROTECTED] Phone: 412-422-3463x4023 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Fragmentation of WAL files
> In response to Jim Nasby <[EMAIL PROTECTED]>: >> I was recently running defrag on my windows/parallels VM and noticed >> a bunch of WAL files that defrag couldn't take care of, presumably >> because the database was running. What's disturbing to me is that >> these files all had ~2000 fragments. It sounds like that filesystem is too stupid to coalesce successive write() calls into one allocation fragment :-(. I agree with the comments that this might not be important, but you could experiment to see --- try increasing the size of "zbuffer" in XLogFileInit to maybe 16*XLOG_BLCKSZ, re-initdb, and see if performance improves. The suggestion to use ftruncate is so full of holes that I won't bother to point them all out, but certainly we could write more than just XLOG_BLCKSZ at a time while preparing the file. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: Filesystem fragmentation (Re: [PERFORM] Fragmentation of WAL files)
Bill Moran wrote: In response to Heikki Linnakangas <[EMAIL PROTECTED]>: Can anyone else confirm this? I don't know if this is a windows-only issue, but I don't know of a way to check fragmentation in unix. I can confirm that it's only a Windows problem. No UNIX filesystem that I'm aware of suffers from fragmentation. What do you mean by suffering? All filesystems fragment files at some point. When and how differs from filesystem to filesystem. And some filesystems might be smarter than others in placing the fragments. To clarify my viewpoint: To my knowledge, there is no Unix filesystem that _suffers_ from fragmentation. Specifically, all filessytems have some degree of fragmentation that occurs, but every Unix filesystem that I am aware of has built-in mechanisms to mitigate this and prevent it from becoming a performance issue. More specifically, this problem was solved on UNIX file systems way back in the 1970's and 1980's. No UNIX file system (including Linux) since then has had significant fragmentation problems, unless the file system gets close to 100% full. If you run below 90% full, fragmentation shouldn't ever be a significant performance problem. The word "fragmentation" would have dropped from the common parlance if it weren't for MS Windoz. Craig ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: Filesystem fragmentation (Re: [PERFORM] Fragmentation of WAL files)
"Craig A. James" <[EMAIL PROTECTED]> writes: > More specifically, this problem was solved on UNIX file systems way back in > the > 1970's and 1980's. No UNIX file system (including Linux) since then has had > significant fragmentation problems, unless the file system gets close to 100% > full. If you run below 90% full, fragmentation shouldn't ever be a significant > performance problem. Note that the main technique used to avoid fragmentation -- paradoxically -- is to break the file up into reasonable sized chunks. This allows the filesystem the flexibility to place the chunks efficiently. In the case of a performance-critical file like the WAL that's always read sequentially it may be to our advantage to defeat this technique and force it to be allocated sequentially. I'm not sure whether any filesystems provide any option to do so. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] [GENERAL] Fw: PostgreSQL Performance Tuning
Hello! I would do the following (in that order): 1.) Check for a performant application logic and application design (e.g. degree of granularity of the Java Hibernate Mapping, are there some object iterators with hundreds of objects, etc.) 2.) Check the hibernate generated queries and whether the query is suitable or not. Also do a "explain query" do see the query plan. Sometimes a manually generated is much more efficient than hibernate ones. 3.) Optimize the database e.g. postgresql. Ciao, Gerhard -- http://www.wiesinger.com/ On Thu, 26 Apr 2007, Shohab Abdullah wrote: Dear, We are facing performance tuning problem while using PostgreSQL Database over the network on a linux OS. Our Database consists of more than 500 tables with an average of 10K records per table with an average of 20 users accessing the database simultaneously over the network. Each table has indexes and we are querying the database using Hibernate. The biggest problem is while insertion, updating and fetching of records, ie the database performance is very slow. It take a long time to respond in the above scenario. Please provide me with the tuning of the database. I am attaching my postgresql.conf file for the reference of our current configuration Please replay me ASAP Regards, Shohab Abdullah Software Engineer, Manufacturing SBU-POWAI Larsen and Toubro Infotech Ltd.| 4th floor, L&T Technology Centre, Saki Vihar Road, Powai, Mumbai-400072 (: +91-22-67767366 | (: +91-9870247322 Visit us at : http://www.lntinfotech.com ÿÿI cannot predict future, I cannot change past, I have just the present moment, I must treat it as my last" The information contained in this email has been classified: [ X] L&T Infotech General Business [] L&T Infotech Internal Use Only [] L&T Infotech Confidential [] L&T Infotech Proprietary This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. __ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: Filesystem fragmentation (Re: [PERFORM] Fragmentation of WAL files)
Gregory Stark <[EMAIL PROTECTED]> writes: > In the case of a performance-critical file like the WAL that's always read > sequentially it may be to our advantage to defeat this technique and force it > to be allocated sequentially. I'm not sure whether any filesystems provide any > option to do so. We more or less do that already by filling the entire file in one go when it's created ... regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] not using indexes on large table
On Sat, 2007-04-21 at 15:17, Jeroen Kleijer wrote: > Hi all, > > I'm a bit new to PostgreSQL and database design in general so forgive me > for asking stupid questions. ;-) > > I've setup a PostgreSQL database on a Linux machine (2 processor, 1GB > mem) and while the database itself resides on a NetApp filer, via NFS, > this doesn't seem to impact the performance to drastically. What does a benchmark like bonnie++ say about your performance? And I hope your data's not too important to you, because I've had LOTS of problems with NFS mounts in the past with pgsql. Generally speaking, NFS can be moderately fast, or moderately reliable (for databases) but it generally isn't both at the same time. Considering the cost of a quartet of 80 Gig SATA drives ($59x4) and a decent RAID controller (LSI, Areca at ~$450 or so) you could be getting VERY good performance out of your system with real reliability at the same time on a RAID-10 volume. Then use the NetApp for backup. That's what I'd do. > I basically use it for indexed tables without any relation between 'em > so far this has worked perfectly. > > For statistics I've created the following table: > volume varchar(30), > qtree varchar(255), > file varchar(512), > ctime timestamp, > mtime timestamp, > atime timestamp > annd created separate indexes on the volume and qtree columns. You might want to look at setting this up as two or three tables with a view and update triggers to look like one table to the user, and the qtree and file in their own tables. that would make your main stats table only one varchar(30) and 3 timestamps wide. Especially if qtree and file tend to be large. If one of those tends to be small and the other large, then look at moving just the large one into its own table. The reasons for this will be obvious later on in this post. > The problems comes when I try to do a query without using a where clause > because by then, it completely discards the indexes and does a complete > table scan which takes over half an hour! (40.710.725 rows, 1110258 > pages, 1715 seconds) Yes it does, and it should. Why? Visibility. This has been discussed quite a bit on the lists. Because of the particular design for PostgreSQL's MVCC implementation, indexes cannot contain visibility information on tables. Therefore, every time the db looks in an index, it then has to look in the table anyway to find the right version of that tuple and to see if it's actually valid for your snapshot. > Can anyone tell me if this is normal behaviour (half an hour seems over > the top to me) and if not, what I can do about it. Yes this is normal behaviour. It's just how PostgreSQL works. There are some workarounds our there that involve updating extra tables that carry things like counts etc... Each of these cost something in overhead. There are two distinct problems here. One is that you're tying to use PostgreSQL in a role where perhaps a different database might be a better choice. MSSQL Server or DB2 or even MySQL might be a better choice depending on what you want to do with your data. The other problem is that you're using an NFS server. Either go whole hog and buy a SAN with dual 2G nics in it or put local storage underneat your machine with LOTS of hard drives in RAID-10. Note that while other databases may be better at some of the queries you're trying to run, it might be that PostgreSQL is still a good choice because of other queries, and you can do rollups of the data that it's slow at while using it for the things it is good at. I've got a test db on my workstation that's pretty big at 42,463,248 rows and taking up 12 Gigs just for the table, 7.7 Gigs in indexes, and a select count(*) on it takes 489 seconds. I try not to do things like that. It covers the last 9 months of statistics. This query: select a, b, count(*) from summary where atime > '2006-06-16' and perspective = 'yada' group by a, b order by a, b took 300 seconds, which is typical. This is on a Workstation with one CPU, 2 gigs of ram, and a 150 Gig SATA drive. It's running X Windows, with Evolution, firefox, and a dozen other user apps up and running. Our "real" server, with 4 disks in a RAID 5 on a mediocre RAID controller but with 2 CPUs and 6 gigs of ram, stomps my little work station into the ground. I have the feeling my laptop with 512 Meg of ram and a 1.6 GHz CPU would be faster than your current server. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] index structure for 114-dimension vector
On Apr 20, 12:07 pm, [EMAIL PROTECTED] (Andrew Lazarus) wrote: > I have a table with 2.5 million real[] arrays. (They are points in a > time series.) Given a new array X, I'd like to find, say, the 25 > closest to X in some sense--for simplification, let's just say in the > usualvectornorm. Speed is critical here, and everything I have tried > has been too slow. > > I imported the cube contrib package, and I tried creating an index on > a cube of the last 6 elements, which are the most important. Then I > tested the 2.5MM rows for being contained within a tolerance of the > last 6 elements of X, +/- 0.1 in each coordinate, figuring that would > be an indexed search (which I CLUSTERED on). I then ran the sort on > this smaller set. The index was used, but it was still too slow. I > also tried creating new columns with rounded int2 values of the last 6 > coordinates and made a multicolumn index. > > For each X the search is taking about 4-15 seconds which is above my > target at least one order of magnitude. Absolute numbers are dependent > on my hardware and settings, and some of this can be addressed with > configuration tweaks, etc., but first I think I need to know the > optimum data structure/indexing strategy. > > Is anyone on the list experienced with this sort of issue? > > Thanks. > Andrew Lazarus [EMAIL PROTECTED] > > ---(end of broadcast)--- > TIP 3: Have you checked our extensive FAQ? > >http://www.postgresql.org/docs/faq Having worked in high dimensional spaces a lot in my career I think you'll find that there are mathematical limits in terms of speed. In practical terms, a seq_scan will be unavoidable since on first approximation you are limited to doing an exhaustive search in 101-dimensional space unless you make approximations or dimensionality reductions of some kind. Read up on the Curse of Dimensionality: http://en.wikipedia.org/wiki/Curse_of_dimensionality Have you considered dimension reduction techniques such as Singular Value Decomposition, Principal Components Analysis, etc.? ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] postgres: 100% CPU utilization
Thanks for this reply, Ron. This is almost what I was looking for. While the upgrade to the latest version is out of the question (which unfortunately for me became the subject of this discussion) still, I was looking for the ways to improve the performance of the 7.0.2 version. Extensive use of vacuum was almost obvious, though I was hoping to get some more tips from postrges gurus (or dinosaurs, if you want). Anyways, the 8.2.4 was not performing so well without auto-vacuum. It ramped up to 50% of CPU usage in 2 hours under the load. With the auto-vacuum ON I've got what I really need and thus I know what to do next. Just for the record - the hardware that was used for the test has the following parameters: AMD Opteron 2GHZ 2GB RAM LSI Logic SCSI Thanks everyone for your assistance! Sergey -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ron Sent: Monday, April 23, 2007 11:07 AM To: Mario Weilguni Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] postgres: 100% CPU utilization At 04:53 AM 4/23/2007, Mario Weilguni wrote: >Am Donnerstag, 19. April 2007 schrieb Sergey Tsukinovsky: > > 2. What would be the recommended set of parameters to tune up in order > > to improve the performance over the time, instead of considering an > > option to vacuum every 30 minutes or so? > > > > 3. Is it safe to run 'vacuum' as frequently as every 15-30 minutes? >No problem. > > > > > 4. Suggestions? >Do yourself a favor and upgrade at least to 8.1.x and use autovacuum. In fact, I'll go one step further and say that pg improves so much from release to release that everyone should make superhuman efforts to always be running the latest stable release. Even the differences between 8.1.x and 8.2.x are worth it. (and the fewer and more modern the releases "out in the wild", the easier community support is) Cheers, Ron Peacetree ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[PERFORM] Simple query, 10 million records...MySQL ten times faster
I have this table: CREATE TABLE test_zip_assoc ( id serial NOT NULL, f_id integer DEFAULT 0 NOT NULL, lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL, long_radians numeric(6,5) DEFAULT 0.0 NOT NULL ); CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians); CREATE INDEX long_radians ON test_zip_assoc USING btree (long_radians); It's basically a table that associates some foreign_key (for an event, for instance) with a particular location using longitude and latitude. I'm basically doing a simple proximity search. I have populated the database with *10 million* records. I then test performance by picking 50 zip codes at random and finding the records within 50 miles with a query like this: SELECT id FROM test_zip_assoc WHERE lat_radians > 0.69014816041 AND lat_radians < 0.71538026567 AND long_radians > -1.35446228028 AND long_radians < -1.32923017502 On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB ram) this query averages 1.5 seconds each time it runs after a brief warmup period. In PostGreSQL it averages about 15 seconds. Both of those times are too slow. I need the query to run in under a second with as many as a billion records. I don't know if this is possible but I'm really hoping someone can help me restructure my indexes (multicolumn?, multiple indexes with a 'where' clause?) so that I can get this running as fast as possible. If I need to consider some non-database data structure in RAM I will do that too. Any help or tips would be greatly appreciated. I'm willing to go to greath lengths to test this if someone can make a good suggestion that sounds like it has a reasonable chance of improving the speed of this search. There's an extensive thread on my efforts already here: http://phpbuilder.com/board/showthread.php?t=10331619&page=10 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster
On 24 Apr 2007 14:26:46 -0700, zardozrocks <[EMAIL PROTECTED]> wrote: I have this table: CREATE TABLE test_zip_assoc ( id serial NOT NULL, f_id integer DEFAULT 0 NOT NULL, lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL, long_radians numeric(6,5) DEFAULT 0.0 NOT NULL ); CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians); CREATE INDEX long_radians ON test_zip_assoc USING btree (long_radians); It's basically a table that associates some foreign_key (for an event, for instance) with a particular location using longitude and latitude. I'm basically doing a simple proximity search. I have populated the database with *10 million* records. I then test performance by picking 50 zip codes at random and finding the records within 50 miles with a query like this: SELECT id FROM test_zip_assoc WHERE lat_radians > 0.69014816041 AND lat_radians < 0.71538026567 AND long_radians > -1.35446228028 AND long_radians < -1.32923017502 On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB ram) this query averages 1.5 seconds each time it runs after a brief warmup period. In PostGreSQL it averages about 15 seconds. Both of those times are too slow. I need the query to run in under a second with as many as a billion records. I don't know if this is possible but I'm really hoping someone can help me restructure my indexes (multicolumn?, multiple indexes with a 'where' clause?) so that I can get this running as fast as possible. If I need to consider some non-database data structure in RAM I will do that too. Any help or tips would be greatly appreciated. I'm willing to go to greath lengths to test this if someone can make a good suggestion that sounds like it has a reasonable chance of improving the speed of this search. There's an extensive thread on my efforts already here: You can always go the earthdist route. the index takes longer to build (like 5x) longer than btree, but will optimize that exact operation. merlin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster
In response to zardozrocks <[EMAIL PROTECTED]>: > I have this table: > > CREATE TABLE test_zip_assoc ( > id serial NOT NULL, > f_id integer DEFAULT 0 NOT NULL, > lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL, > long_radians numeric(6,5) DEFAULT 0.0 NOT NULL > ); > CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians); > CREATE INDEX long_radians ON test_zip_assoc USING btree > (long_radians); > > > > It's basically a table that associates some foreign_key (for an event, > for instance) with a particular location using longitude and > latitude. I'm basically doing a simple proximity search. I have > populated the database with *10 million* records. I then test > performance by picking 50 zip codes at random and finding the records > within 50 miles with a query like this: > > SELECT id > FROM test_zip_assoc > WHERE > lat_radians > 0.69014816041 > AND lat_radians < 0.71538026567 > AND long_radians > -1.35446228028 > AND long_radians < -1.32923017502 > > > On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB > ram) this query averages 1.5 seconds each time it runs after a brief > warmup period. In PostGreSQL it averages about 15 seconds. > > Both of those times are too slow. I need the query to run in under a > second with as many as a billion records. I don't know if this is > possible but I'm really hoping someone can help me restructure my > indexes (multicolumn?, multiple indexes with a 'where' clause?) so > that I can get this running as fast as possible. > > If I need to consider some non-database data structure in RAM I will > do that too. Any help or tips would be greatly appreciated. I'm > willing to go to greath lengths to test this if someone can make a > good suggestion that sounds like it has a reasonable chance of > improving the speed of this search. There's an extensive thread on my > efforts already here: > > http://phpbuilder.com/board/showthread.php?t=10331619&page=10 Why didn't you investigate/respond to the last posts there? The advice to bump shared_buffers is good advice. work_mem might also need bumped. Figure out which postgresql.conf your system is using and get it dialed in for your hardware. You can make all the indexes you want, but if you've told Postgres that it only has 8M of RAM to work with, performance is going to suck. I don't see hardware specs on that thread (but I didn't read the whole thing) If the system you're using is a dedicated DB system, set shared_buffers to 1/3 - 1/2 of the physical RAM on the machine for starters. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ [EMAIL PROTECTED] Phone: 412-422-3463x4023 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] [GENERAL] PostgreSQL Performance Tuning
Shohab Abdullah wrote: > > Dear, > We are facing performance tuning problem while using PostgreSQL Database > over the network on a linux OS. > Our Database consists of more than 500 tables with an average of 10K > records per table with an average of 20 users accessing the database > simultaneously over the network. Each table has indexes and we are > querying the database using Hibernate. > The biggest problem is while insertion, updating and fetching of > records, ie the database performance is very slow. It take a long time > to respond in the above scenario. > Please provide me with the tuning of the database. I am attaching my > *postgresql.conf* file for the reference of our current configuration Have you changed _anything_ from the defaults? The defaults are set so PG will run on as many installations as practical. They are not set for performance - that is specific to your equipment, your data, and how you need to handle the data. Assuming the record sizes aren't huge, that's not a very large data set nor number of users. Look at these for starters: http://www.varlena.com/GeneralBits/Tidbits/perf.html http://www.varlena.com/GeneralBits/Tidbits/annotated_conf_e.html You might try setting the logging parameters to log queries longer than "x" (where x is in milliseconds - you will have to decide the appropriate value for "too long") and start looking into those first. Make sure that you are running "analyze" if it is not being run by autovacuum. Use "EXPLAIN " to see how the query is being planned - as a first-pass assume that on any reasonably sized table the words "sequential scan" means "fix this". Note that you may have to cast variables in a query to match the variable in an index in order for the planner to figure out that it can use the index. Read the guidelines then take an educated stab at some settings and see how they work - other than turning off fsync, there's not much in postgresql.conf that will put your data at risk. Cheers, Steve ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster
zardozrocks wrote: lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL, long_radians numeric(6,5) DEFAULT 0.0 NOT NULL Native data types such as integer or real are much faster than numeric. If you need 6 digits, it's better to multiply your coordinates by 10^6 and store as INTEGER. On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB ram) this query averages 1.5 seconds each time it runs after a brief warmup period. In PostGreSQL it averages about 15 seconds. What hard drive(s) and controller(s) do you have? Please post EXPLAIN ANALYZE output of the problem query and your postgresql.conf also. -- Benjamin Minshall <[EMAIL PROTECTED]> Senior Developer -- Intellicon, Inc. smime.p7s Description: S/MIME Cryptographic Signature
Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster
NUMERIC operations are very slow in pgsql. Equality comparisons are somewhat faster, but other operations are very slow compared to other vendor's NUMERIC. We've sped it up a lot here internally, but you may want to consider using FLOAT for what you are doing. - Luke Msg is shrt cuz m on ma treo -Original Message- From: Bill Moran [mailto:[EMAIL PROTECTED] Sent: Thursday, April 26, 2007 05:13 PM Eastern Standard Time To: zardozrocks Cc: pgsql-performance@postgresql.org Subject:Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster In response to zardozrocks <[EMAIL PROTECTED]>: > I have this table: > > CREATE TABLE test_zip_assoc ( > id serial NOT NULL, > f_id integer DEFAULT 0 NOT NULL, > lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL, > long_radians numeric(6,5) DEFAULT 0.0 NOT NULL > ); > CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians); > CREATE INDEX long_radians ON test_zip_assoc USING btree > (long_radians); > > > > It's basically a table that associates some foreign_key (for an event, > for instance) with a particular location using longitude and > latitude. I'm basically doing a simple proximity search. I have > populated the database with *10 million* records. I then test > performance by picking 50 zip codes at random and finding the records > within 50 miles with a query like this: > > SELECT id > FROM test_zip_assoc > WHERE > lat_radians > 0.69014816041 > AND lat_radians < 0.71538026567 > AND long_radians > -1.35446228028 > AND long_radians < -1.32923017502 > > > On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB > ram) this query averages 1.5 seconds each time it runs after a brief > warmup period. In PostGreSQL it averages about 15 seconds. > > Both of those times are too slow. I need the query to run in under a > second with as many as a billion records. I don't know if this is > possible but I'm really hoping someone can help me restructure my > indexes (multicolumn?, multiple indexes with a 'where' clause?) so > that I can get this running as fast as possible. > > If I need to consider some non-database data structure in RAM I will > do that too. Any help or tips would be greatly appreciated. I'm > willing to go to greath lengths to test this if someone can make a > good suggestion that sounds like it has a reasonable chance of > improving the speed of this search. There's an extensive thread on my > efforts already here: > > http://phpbuilder.com/board/showthread.php?t=10331619&page=10 Why didn't you investigate/respond to the last posts there? The advice to bump shared_buffers is good advice. work_mem might also need bumped. Figure out which postgresql.conf your system is using and get it dialed in for your hardware. You can make all the indexes you want, but if you've told Postgres that it only has 8M of RAM to work with, performance is going to suck. I don't see hardware specs on that thread (but I didn't read the whole thing) If the system you're using is a dedicated DB system, set shared_buffers to 1/3 - 1/2 of the physical RAM on the machine for starters. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ [EMAIL PROTECTED] Phone: 412-422-3463x4023 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] index structure for 114-dimension vector
On 4/20/07, Andrew Lazarus <[EMAIL PROTECTED]> wrote: I have a table with 2.5 million real[] arrays. (They are points in a time series.) Given a new array X, I'd like to find, say, the 25 closest to X in some sense--for simplification, let's just say in the usual vector norm. Speed is critical here, and everything I have tried has been too slow. Let me chime in with the observation that this is a multidimensional nearest neighbour (reverse nearest neighbour and its close cousin, k-NN) that is well known in statistics, and particularly relevant to statistical learning and classification. Knowing the jargon might help you dig up efficient algorithms to mine your data; there are tons of fascinating papers available through Citeseer. In particular, I recommend the paper "Efficient k-NN Search on Vertically Decomposed Data" by de Vries et al, SIGMOD 2002 (PDF here: http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It proposes an algorithm called BOND to drastically reduce the search space by probalistic means. They give an example using image histograms, but the algorithm applies to any multidimensional data. Briefly put, it points out that proximity comparison can be computed vertically, a few dimensions at a time, and entire subsets can be thrown away when it's apparent that they are below a statistically derived lower bound. The only gotcha is that the algorithm derives much of its performance from the assumption that your data is vertically decomposed, one table per dimension, otherwise the search effectively incurs a sequential scan of the entire dataset, and then you're pretty much back to square one. The most common approach to nearest neighbour search is to use a spatial data structure. The classic algorithm is the kd-tree (http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B tree, neither of which are available in PostgreSQL. If I remember correctly, R-trees have also been shown to be useful for high numbers of dimensions; with PostgreSQL you have R-trees and even better R-tree-equivalent support through GiST. I have no idea whether you can actually munge your integer vectors into something GiST can index and search, but it's a thought. (GiST, presumably, can also theoretically index kd-trees and other spatial trees.) Alexander. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster
On Tue, 2007-04-24 at 16:26, zardozrocks wrote: > I have this table: > > CREATE TABLE test_zip_assoc ( > id serial NOT NULL, > f_id integer DEFAULT 0 NOT NULL, > lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL, > long_radians numeric(6,5) DEFAULT 0.0 NOT NULL > ); Like someone else mentioned numeric types are SLOW. See if you can use integers, or at least floats. I also wonder if you might be better served with geometric types and GiST indexes on them than using your lat / long grid. With a geometric type, you could define the lat / long as a point and use geometric operations with it. See the pgsql manual: http://www.postgresql.org/docs/8.1/static/datatype-geometric.html http://www.postgresql.org/docs/8.1/static/functions-geometry.html > It's basically a table that associates some foreign_key (for an event, > for instance) with a particular location using longitude and > latitude. I'm basically doing a simple proximity search. I have > populated the database with *10 million* records. I then test > performance by picking 50 zip codes at random and finding the records > within 50 miles with a query like this: I assume that there aren't 10 million zip codes, right? Are you storing the lat / long of the individual venues? Or the zip codes? If you're storing the lat / long of the zips, then I can't imagine there are 10 million zip codes. If you could use the lat / long numbers to find the zip codes that are in your range, then join that to a venue table that fks off of the zip code table, I would think it would be much faster, as you'd have a smaller data set to trundle through. > SELECT id > FROM test_zip_assoc > WHERE > lat_radians > 0.69014816041 > AND lat_radians < 0.71538026567 > AND long_radians > -1.35446228028 > AND long_radians < -1.32923017502 > > > On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB > ram) this query averages 1.5 seconds each time it runs after a brief > warmup period. In PostGreSQL it averages about 15 seconds. I wonder how well it would run if you had 10, 20, 30, 40 etc... users running it at the same time. My guess is that you'll be very lucky to get anything close to linear scaling in any database. That's because this is CPU / Memory bandwidth intensive, so it's gonna kill your DB. OTOH, if it was I/O bound you could throw more hardware at it (bb cache RAID controller, etc) > Both of those times are too slow. I need the query to run in under a > second with as many as a billion records. I don't know if this is > possible but I'm really hoping someone can help me restructure my > indexes (multicolumn?, multiple indexes with a 'where' clause?) so > that I can get this running as fast as possible. You're trying to do a whole lot of processing in a little time. You're either gonna have to accept a less exact answer (i.e. base it on zip codes) or come up with some way of mining the data for the answers ahead of time, kind of like a full text search for lat and long. So, have you tried what I suggested about increasing shared_buffers and work_mem yet? ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster
zardozrocks wrote: I have this table: CREATE TABLE test_zip_assoc ( id serial NOT NULL, f_id integer DEFAULT 0 NOT NULL, lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL, long_radians numeric(6,5) DEFAULT 0.0 NOT NULL ); CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians); CREATE INDEX long_radians ON test_zip_assoc USING btree (long_radians); Maybe I'm missing something, but wouldn't it be easier to just use PostGIS? Or failing that, using the vanilla built-in point type and an r-tree index? That's what r-tree indexes are made for. -- Jeff Hoffmann Head Plate Spinner PropertyKey.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] postgres: 100% CPU utilization
On Tue, 2007-04-24 at 10:30, Sergey Tsukinovsky wrote: > Thanks for this reply, Ron. > This is almost what I was looking for. > > While the upgrade to the latest version is out of the question (which > unfortunately for me became the subject of this discussion) still, I was > looking for the ways to improve the performance of the 7.0.2 version. > > Extensive use of vacuum was almost obvious, though I was hoping to get > some more tips from postrges gurus (or dinosaurs, if you want). > > Anyways, the 8.2.4 was not performing so well without auto-vacuum. It > ramped up to 50% of CPU usage in 2 hours under the load. > With the auto-vacuum ON I've got what I really need and thus I know what > to do next. Could you give us a better picture of how you were testing 8.2.4? My guess is that you were doing something that seemed right to you, but was working against yourself, like constant vacuum fulls and getting index bloat, or something else. Why were you trying to not use autovacuum, btw? I've found it to be quite capable, with only a few situations (high speed queueing) where I needed to manually schedule vacuums. And I've never seen a situation since about 7.4 where regular full vacuums were required. > Just for the record - the hardware that was used for the test has the > following parameters: > AMD Opteron 2GHZ > 2GB RAM > LSI Logic SCSI Nice hardware. I'd really like to hear the logic behind your statement that upgrading to 8.1 or 8.2 is out of the question. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster
On 24 Apr 2007 14:26:46 -0700, zardozrocks <[EMAIL PROTECTED]> wrote: I have this table: CREATE TABLE test_zip_assoc ( id serial NOT NULL, f_id integer DEFAULT 0 NOT NULL, lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL, long_radians numeric(6,5) DEFAULT 0.0 NOT NULL ); CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians); CREATE INDEX long_radians ON test_zip_assoc USING btree (long_radians); This is a spatial search -- B-tree indexes are much less efficient than this than certain other data structures. The R-tree and its many variants are based on subdividing the space in regions, allowing you to do efficient checks on containment, intersection, etc., based on points or bounding boxes. PostgreSQL implements R-trees natively as well as through a mechanism called GiST, a framework for implementing pluggable tree-like indexes. It also provides some geometric data types. However, as far as I know, PostgreSQL's R-tree/GiST indexes do *not* provide the operators to do bounding box searches. For this you need PostGIS. PostGIS implements the whole GIS stack, and it's so good at this that it's practically the de facto tool among GIS analysts. Installing PostGIS into a database is simple, and once you have done this, you can augment your table with a geometry (*): alter table test_zip_assoc add column lonlat geometry; update test_zip_assoc set lonlat = makepoint( long_radians / (3.14159265358979 / 180), lat_radians / (3.14159265358979 / 180)); The division is to convert your radians into degrees; PostGIS works with degrees, at least out of the box. Now you can query on a bounding box (although, are you sure you got your lons and lats in order? That's Antarctica, isn't it?): select * from test_zip_assoc where lonlat && makebox2d( makepoint(-77.6049721697096, 39.5425768302107), makepoint(-76.1592790300818, 40.9882699698386)) This is bound to be blazingly fast. Next you can order by geographic distance if you like: order by distance_sphere(lonlat, makepoint(-77.6049721697096, 39.5425768302107)) Nobody has mentioned PostGIS so far, so I hope I'm not missing some crucial detail, like "no spatial indexes allowed!". (*) I cheated here. The PostGIS manual recommends that you use a function to create geometric column, because it will set up some auxilary data structures for you that are needed for certain operations. The recommended syntax is: select AddGeometryColumn('', 'test_zip_assoc', 'geom', -1, 'POINT', 2); Alexander. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster
On 4/27/07, Alexander Staubo <[EMAIL PROTECTED]> wrote: [snip] PostGIS implements the whole GIS stack, and it's so good at this that it's practically the de facto tool among GIS analysts. Installing PostGIS into a database is simple, and once you have done this, you can augment your table with a geometry (*): alter table test_zip_assoc add column lonlat geometry; I forgot to include the crucial step, of course: create index test_zip_assoc_lonlat_index on test_zip_assoc using gist (lonlat gist_geometry_ops); analyze test_zip_assoc_lonlat; This creates a GiST index on the geometry and (significantly) updates the table statistics. Alexander. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] postgres: 100% CPU utilization
Sergey Tsukinovsky wrote: Just for the record - the hardware that was used for the test has the following parameters: AMD Opteron 2GHZ 2GB RAM LSI Logic SCSI And you ran FreeBSD 4.4 on it right? This may be a source of high cpu utilization in itself if the box is SMP or dual core, as multi-cpu support was pretty primitive in that release (4.12 would be better if you are required to stick to the 4.x branch, if not the 6.2 is recommended)! Cheers Mark ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster
Is there a reason you are not using postgis. The R tree indexes are designed for exactly this type of query and should be able to do it very quickly. Hope that helps, Joe > I have this table: > > CREATE TABLE test_zip_assoc ( > id serial NOT NULL, > f_id integer DEFAULT 0 NOT NULL, > lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL, > long_radians numeric(6,5) DEFAULT 0.0 NOT NULL > ); > CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians); > CREATE INDEX long_radians ON test_zip_assoc USING btree > (long_radians); > > > > It's basically a table that associates some foreign_key (for an event, > for instance) with a particular location using longitude and > latitude. I'm basically doing a simple proximity search. I have > populated the database with *10 million* records. I then test > performance by picking 50 zip codes at random and finding the records > within 50 miles with a query like this: > > SELECT id > FROM test_zip_assoc > WHERE > lat_radians > 0.69014816041 > AND lat_radians < 0.71538026567 > AND long_radians > -1.35446228028 > AND long_radians < -1.32923017502 > > > On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB > ram) this query averages 1.5 seconds each time it runs after a brief > warmup period. In PostGreSQL it averages about 15 seconds. > > Both of those times are too slow. I need the query to run in under a > second with as many as a billion records. I don't know if this is > possible but I'm really hoping someone can help me restructure my > indexes (multicolumn?, multiple indexes with a 'where' clause?) so > that I can get this running as fast as possible. > > If I need to consider some non-database data structure in RAM I will > do that too. Any help or tips would be greatly appreciated. I'm > willing to go to greath lengths to test this if someone can make a > good suggestion that sounds like it has a reasonable chance of > improving the speed of this search. There's an extensive thread on my > efforts already here: > > http://phpbuilder.com/board/showthread.php?t=10331619&page=10 > > > ---(end of broadcast)--- > TIP 9: In versions below 8.0, the planner will ignore your desire to >choose an index scan if your joining column's datatypes do not >match > ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PERFORM] Feature Request --- was: PostgreSQL Performance Tuning
Steve Crawford wrote: Have you changed _anything_ from the defaults? The defaults are set so PG will run on as many installations as practical. They are not set for performance - that is specific to your equipment, your data, and how you need to handle the data. Is this really the sensible thing to do? I know we should not encourage the world we're leaving in even more in the ways of "have the computer do everything for us so that we don't need to have even a clue about what we're doing" ... But, wouldn't it make sense that the configure script determines the amount of physical memory and perhaps even do a HD speed estimate to set up defaults that are closer to a performance-optimized configuration? Then, perhaps command switches so that you could specify the type of access you estimate for your system. Perhaps something like: ./configure --db-size=100GB --write-percentage=20 etc. (switch write-percentage above indicates that we estimate that 20% of the DB activity would be writing to the disk --- there may be other switches to indicate the percentage of queries that are transactions, the percentage of queries that are complex; percentage that require index usage, etc. etc. etc.)... And then, based on that, a better set of defaults could be set by the configuration script. Does this make sense? Or perhaps I'm watching too much science fiction? Carlos -- ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning
Carlos Moreno <[EMAIL PROTECTED]> writes: > ... But, wouldn't it make sense that the configure script > determines the amount of physical memory and perhaps even do a HD > speed estimate to set up defaults that are closer to a > performance-optimized > configuration? No. Most copies of Postgres these days are executed on machines very far away from where the code was built. It's a little bit safer to try to tune things at initdb time ... as indeed we already do. But the fundamental problem remains that we don't know that much about how the installation will be used. For example, the planner configuration parameters turn out to have not that much to do with the absolute speed of your drive, and a whole lot to do with the ratio of the size of your database to the amount of RAM you've got; and the ultimate size of the DB is one thing initdb certainly can't guess. Also, there is an extremely good reason why Postgres will never be set up to try to take over the whole machine by default: most of the developers run multiple postmasters on their machines. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
[PERFORM] Usage up to 50% CPU
Hi, I have pg 8.1.4 running in Windows XP Pro wirh a Pentium D and I notice that I can not use more than 50% of the cpus (Pentium D has 2 cpus), how can I change the settings to use the 100% of it. Regards, Andrew Retzlaff _ Advertisement: Visit LetsShop.com to WIN Fabulous Books Weekly http://a.ninemsn.com.au/b.aspx?URL=http%3A%2F%2Fwww%2Eletsshop%2Ecom%2FLetsShopBookClub%2Ftabid%2F866%2FDefault%2Easpx&_t=751480117&_r=HM_Tagline_books&_m=EXT ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] index structure for 114-dimension vector
On Fri, 27 Apr 2007, Alexander Staubo wrote: On 4/20/07, Andrew Lazarus <[EMAIL PROTECTED]> wrote: I have a table with 2.5 million real[] arrays. (They are points in a time series.) Given a new array X, I'd like to find, say, the 25 closest to X in some sense--for simplification, let's just say in the usual vector norm. Speed is critical here, and everything I have tried has been too slow. Let me chime in with the observation that this is a multidimensional nearest neighbour (reverse nearest neighbour and its close cousin, k-NN) that is well known in statistics, and particularly relevant to statistical learning and classification. Knowing the jargon might help you dig up efficient algorithms to mine your data; there are tons of fascinating papers available through Citeseer. In particular, I recommend the paper "Efficient k-NN Search on Vertically Decomposed Data" by de Vries et al, SIGMOD 2002 (PDF here: http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It proposes an algorithm called BOND to drastically reduce the search space by probalistic means. They give an example using image histograms, but the algorithm applies to any multidimensional data. Briefly put, it points out that proximity comparison can be computed vertically, a few dimensions at a time, and entire subsets can be thrown away when it's apparent that they are below a statistically derived lower bound. The only gotcha is that the algorithm derives much of its performance from the assumption that your data is vertically decomposed, one table per dimension, otherwise the search effectively incurs a sequential scan of the entire dataset, and then you're pretty much back to square one. The most common approach to nearest neighbour search is to use a spatial data structure. The classic algorithm is the kd-tree (http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B tree, neither of which are available in PostgreSQL. If I remember correctly, R-trees have also been shown to be useful for high numbers of dimensions; with PostgreSQL you have R-trees and even better R-tree-equivalent support through GiST. I have no idea whether you can actually munge your integer vectors into something GiST can index and search, but it's a thought. (GiST, presumably, can also theoretically index kd-trees and other spatial trees.) you're right, but currently only theoretically due to interface restriction. We have plan to improve it sometime. There was SP-GiST project, which could be used for k-d-b tree, see http://www.cs.purdue.edu/spgist/ I don't know if it works with 8.2 version. Also, it doesn't supports concurrency and recovery Alexander. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster
Folks, we in astronomy permanently work with billiards objects with spherical atributes and have several sky-indexing schemes. See my page for links http://www.sai.msu.su/~megera/wiki/SkyPixelization We have q3c package for PostgreSQL available from q3c.sf.net, which we use in production with terabytes-sized database. Oleg On Thu, 26 Apr 2007, Scott Marlowe wrote: On Tue, 2007-04-24 at 16:26, zardozrocks wrote: I have this table: CREATE TABLE test_zip_assoc ( id serial NOT NULL, f_id integer DEFAULT 0 NOT NULL, lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL, long_radians numeric(6,5) DEFAULT 0.0 NOT NULL ); Like someone else mentioned numeric types are SLOW. See if you can use integers, or at least floats. I also wonder if you might be better served with geometric types and GiST indexes on them than using your lat / long grid. With a geometric type, you could define the lat / long as a point and use geometric operations with it. See the pgsql manual: http://www.postgresql.org/docs/8.1/static/datatype-geometric.html http://www.postgresql.org/docs/8.1/static/functions-geometry.html It's basically a table that associates some foreign_key (for an event, for instance) with a particular location using longitude and latitude. I'm basically doing a simple proximity search. I have populated the database with *10 million* records. I then test performance by picking 50 zip codes at random and finding the records within 50 miles with a query like this: I assume that there aren't 10 million zip codes, right? Are you storing the lat / long of the individual venues? Or the zip codes? If you're storing the lat / long of the zips, then I can't imagine there are 10 million zip codes. If you could use the lat / long numbers to find the zip codes that are in your range, then join that to a venue table that fks off of the zip code table, I would think it would be much faster, as you'd have a smaller data set to trundle through. SELECT id FROM test_zip_assoc WHERE lat_radians > 0.69014816041 AND lat_radians < 0.71538026567 AND long_radians > -1.35446228028 AND long_radians < -1.32923017502 On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB ram) this query averages 1.5 seconds each time it runs after a brief warmup period. In PostGreSQL it averages about 15 seconds. I wonder how well it would run if you had 10, 20, 30, 40 etc... users running it at the same time. My guess is that you'll be very lucky to get anything close to linear scaling in any database. That's because this is CPU / Memory bandwidth intensive, so it's gonna kill your DB. OTOH, if it was I/O bound you could throw more hardware at it (bb cache RAID controller, etc) Both of those times are too slow. I need the query to run in under a second with as many as a billion records. I don't know if this is possible but I'm really hoping someone can help me restructure my indexes (multicolumn?, multiple indexes with a 'where' clause?) so that I can get this running as fast as possible. You're trying to do a whole lot of processing in a little time. You're either gonna have to accept a less exact answer (i.e. base it on zip codes) or come up with some way of mining the data for the answers ahead of time, kind of like a full text search for lat and long. So, have you tried what I suggested about increasing shared_buffers and work_mem yet? ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Fragmentation of WAL files
On Thu, 26 Apr 2007, Bill Moran wrote: I've seen marketing material that claims that modern NTFS doesn't suffer performance problems from fragmentation. You're only reading half of the marketing material then. For a balanced picture, read the stuff generated by the companies that sell defragmenting tools. A good one to start with is http://files.diskeeper.com/pdf/HowFileFragmentationOccursonWindowsXP.pdf Going back to the Jim's original question, they suggest a Microsoft paper that talks about how the defrag report can be misleading in respect to open files. See http://support.microsoft.com/kb/228198 Also, some of the most interesting details they gloss over are specific to which version of Windows you're using; the reference guide to the subject of how NTFS decides how much space to pre-allocate at a time is available at http://support.microsoft.com/kb/841551 (ZIP file wrapped into EXE, yuck!) If you compare them, you can see that the approach they're using in NTFS has evolved to become more like that used by a good UNIX filesystem over time. I think your typical UNIX still has a healthy lead in this area, but the difference isn't as big as it used to be. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] index structure for 114-dimension vector
On 21-4-2007 1:42 Mark Kirkwood wrote: I don't think that will work for the vector norm i.e: |x - y| = sqrt(sum over j ((x[j] - y[j])^2)) I don't know if this is usefull here, but I was able to rewrite that algorithm for a set of very sparse vectors (i.e. they had very little overlapping factors) to something like: |x - y| = sum over j (x[j]^2) + sum over j (y[j]^2) + for each j where x[j] and y[j] are both non-zero: - (x[j]^2 + y[j]^2) + (x[j] - y[j])^2 The first two parts sums can be calculated only once. So if you have very little overlap, this is therefore much more efficient (if there is no overlap at all you end up with x[j]^2 + y[j]^2 anyway). Besides, this rewritten calculation allows you to store the X and Y vectors using a trivial table-layout vector(x,i,value) which is only filled with non-zero's and which you can trivially self-join to find the closest matches. You don't care about the j's where there is either no x or y-value anyway with this rewrite. I can compare over 1000 y's of on average 100 elements to two x's of over 1000 elements on just a single 1.8Ghz amd processor. (I use it for a bi-kmeans algorithm, so there are only two buckets to compare to). So it might be possible to rewrite your algorithm to be less calculation-intensive. Obviously, with a dense-matrix this isn't going to work, but there may be other ways to circumvent parts of the algorithm or to cache large parts of it. It might also help to extract only the 6 relevant columns into a seperate temporary table which will have much smaller records and thus can fit more records per page. Best regards, Arjen ---(end of broadcast)--- TIP 6: explain analyze is your friend