Re: [pgsql-advocacy] [PERFORM] Postgres and really huge tables
On Thu, 18 Jan 2007, Tom Lane wrote: Brian Hurt [EMAIL PROTECTED] writes: Is there any experience with Postgresql and really huge tables? I'm talking about terabytes (plural) here in a single table. The 2MASS sky survey point-source catalog http://www.ipac.caltech.edu/2mass/releases/allsky/doc/sec2_2a.html is 470 million rows by 60 columns; I don't have it loaded up but a very conservative estimate would be a quarter terabyte. (I've got a copy of the data ... 5 double-sided DVDs, gzipped ...) I haven't heard from Rae Stiening recently but I know he's been using Postgres to whack that data around since about 2001 (PG 7.1 or so, which is positively medieval compared to current releases). So at least for static data, it's certainly possible to get useful results. What are your processing requirements? We are working in production with 2MASS and other catalogues, and 2MASS is not the biggest. The nomad catalog has more than milliard records. You could query them online http://vo.astronet.ru/cas/conesearch.php Everything is in PostgreSQL 8.1.5 and at present migrate to the 8.2.1, which is very slow, since slow COPY. The hardware we use is HP rx1620, dual Itanium2, MSA 20, currently 4.5 Tb. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] DB benchmark and pg config file help
On 1/17/07, Kevin Hunter [EMAIL PROTECTED] wrote: Hello List, Not sure to which list I should post (gray lines, and all that), so point me in the right direction if'n it's a problem. I am in the process of learning some of the art/science of benchmarking. Given novnov's recent post about the comparison of MS SQL vs PostgresQL, I felt it time to do a benchmark comparison of sorts for myself . . . more for me and the benchmark learning process than the DB's, but I'm interested in DB's in general, so it's a good fit. (If I find anything interesting/new, I will of course share the results.) Just remember that all the major commercial databases have anti-benchmark clauses in their license agreements. So, if you decide to publish your results (especially in a formal benchmark), you can't mention the big boys by name. [yes this is cowardice] merlin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] DB benchmark and pg config file help
On 19 Jan 2007 at 8:45a -0500, Merlin Moncure wrote: On 1/17/07, Kevin Hunter [hunteke∈earlham.edu] wrote: I am in the process of learning some of the art/science of benchmarking. Given novnov's recent post about the comparison of MS SQL vs PostgresQL, I felt it time to do a benchmark comparison of sorts for myself . . . more for me and the benchmark learning process than the DB's, but I'm interested in DB's in general, so it's a good fit. (If I find anything interesting/new, I will of course share the results.) Just remember that all the major commercial databases have anti-benchmark clauses in their license agreements. So, if you decide to publish your results (especially in a formal benchmark), you can't mention the big boys by name. [yes this is cowardice] Anti-benchmark clauses in the license agreements?!? Cowardice indeed! wry_lookSo, by implication, I should do my benchmarking with borrowed copies, right? No sale, no agreement . . . /wry_look Seriously though, that would have bitten me. Thank you, I did not know that. Does that mean that I can't publish the results outside of my work/research/personal unit at all? Or do I just need to obscure about which DB I'm talking? (Like Vendor {1,2,3,...} Product). Appreciatively, Kevin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Postgres and really huge tables
On 1/18/07, Brian Hurt [EMAIL PROTECTED] wrote: Is there any experience with Postgresql and really huge tables? I'm talking about terabytes (plural) here in a single table. Obviously the table will be partitioned, and probably spread among several different file systems. Any other tricks I should know about? A pretty effective partitioning strategy that works in some cases is to identify a criteria in your dataset that isolates your data on a session basis. For example, if you have a company_id that divides up your company data and a session only needs to deal with company_id, you can separate out all your tables based on company_id into different schemas and have the session set the search_path variable when it logs in. Data that does not partition on your criteria sits in public schemas that all the companies can see. This takes advantage of a special trick regarding stored procedures that they do not attach to tables until the first time they are executed in a session -- keeping you from having to make a function for each schema. (note: views do not have this property). You can still cross query using views and the like or hand rolled sql. I would call this type of partitioning logical partitioning since you are leveraging logical divisions in your data. It obviously doesn't work in all cases but when it does it works great. We have a problem of that form here. When I asked why postgres wasn't being used, the opinion that postgres would just explicitive die was given. Personally, I'd bet money postgres could handle the problem (and better than the ad-hoc solution we're currently using). But I'd like a couple of replies of the form yeah, we do that here- no problem to wave around. pg will of course not die as when your dataset hits a certain threshold. It will become slower based on well know mathematical patterns that grow with your working set size. One of the few things that gets to be a pain with large tables is vacuum -- since you can't vacuum a piece of table and there are certain annoyances with having a long running vacuum this is something to think about. Speaking broadly about table partitioning, it optimizes one case at the expense of another. Your focus (IMO) should be on reducing your working set size under certain conditions -- not the physical file size. If you have a properly laid out and logical dataset and can identify special cases where you need some information and not other information, the partitioning strategy should fall into place, whether it is to do nothing, isolate data into separate schemas/tables/files, or use the built in table partitioning feature (which to be honest I am not crazy about). merlin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] DB benchmark and pg config file help
On Fri, Jan 19, 2007 at 09:05:35 -0500, Kevin Hunter [EMAIL PROTECTED] wrote: Seriously though, that would have bitten me. Thank you, I did not know that. Does that mean that I can't publish the results outside of my work/research/personal unit at all? Or do I just need to obscure about which DB I'm talking? (Like Vendor {1,2,3,...} Product). Check with your lawyer. Depending on where you are, those clauses may not even be valid. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Configuration Advice
On 1/17/07, Steve [EMAIL PROTECTED] wrote: Hey there; I've been lurking on this list awhile, and I've been working with postgres for a number of years so I'm not exactly new to this. But I'm still having trouble getting a good balance of settings and I'd like to see what other people think. We may also be willing to hire a contractor to help tackle this problem if anyone is interested. I happen to be something of a cobol-sql expert, if you are interested in some advice you can contact me off-list. I converted an enterprise cobol (in acucobol) app to Postgresql by plugging pg into the cobol system via custom c++ isam driver. I've got an application here that runs large (in terms of length -- the queries have a lot of conditions in them) queries that can potentially return millions of rows but on average probably return tens of thousands of rows. It's read only for most of the day, and pretty much all the queries except one are really fast. If it's just one query I think I'd focus on optimizing that query, not .conf settings. In my opinion .conf tuning (a few gotchas aside) doesn't really get you all that much. However, each night we load data from a legacy cobol system into the SQL system and then we summarize that data to make the reports faster. This load process is intensely insert/update driven but also has a hefty amount of selects as well. This load process is taking ever longer to complete. SO ... our goal here is to make this load process take less time. It seems the big part is building the big summary table; this big summary table is currently 9 million rows big. Every night, we drop the table, re-create it, build the 9 million rows of data (we use COPY to put hte data in when it's prepared, not INSERT), and then build the indexes on it -- of which there are many. Unfortunately this table gets queried in a lot of different ways and needs these indexes; also unfortunately, we have operator class indexes to support both ASC and DESC sorting on I have some very specific advice here. Check out row-wise comparison feature introduced in 8.2. columns so these are for all intents and purposes duplicate but required under Postgres 8.1 (we've recently upgraded to Postgres 8.2, is this still a requirement?) Building these indexes takes forever! It's a long grind through inserts and then building the indexes takes a hefty amount of time too. (about 9 hours). Now, the application is likely part at fault, and we're working to make it more efficient, but it has nothing to do with the index building time. I'm wondering what we can do to make this better if anything; would it be better to leave the indexes on? It doesn't seem to be. Would it be better to use INSERTs instead of copies? Doesn't seem to no. probably any optimization strategies would focus on reducing the amount of data you had to load. merlin ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] DB benchmark and pg config file help
On 19 Jan 2007 at 10:56a -0600, Bruno Wolff III wrote: On Fri, Jan 19, 2007 at 09:05:35 -0500, Kevin Hunter [EMAIL PROTECTED] wrote: Seriously though, that would have bitten me. Thank you, I did not know that. Does that mean that I can't publish the results outside of my work/research/personal unit at all? Or do I just need to obscure about which DB I'm talking? (Like Vendor {1,2,3,...} Product). Check with your lawyer. Depending on where you are, those clauses may not even be valid. grins / /me = student = no money . . . lawyer? You /are/ my lawyers. ;) Well, sounds like America's legal system/red tape will at least slow my efforts against the non-open source DBs, until I get a chance to find out for sure. I really do appreciate the warnings/heads ups. Kevin BTW: I'm currently located in Richmond, IN, USA. A pin for someone's map. :) ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] [pgsql-advocacy] Postgres and really huge tables
A lot of data, but not a lot of records... I don't know if that's valid. I guess the people at Greenplum and/or Sun have more exciting stories ;) Not really. Pretty much multi-terabyte tables are fine on vanilla PostgreSQL if you can stick to partitioned and/or indexed access. If you need to do unindexed fishing expeditions on 5tb of data, then talk to Greenplum. http://www.powerpostgresql.com/Downloads/terabytes_osc2005.pdf -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq