Re: [GENERAL] optimizing daily data storage in Pg
On 7/23/2010 12:39 AM, P Kishor wrote: On Thu, Jul 22, 2010 at 4:56 PM, Andy Colson wrote: On 7/22/2010 9:41 AM, P Kishor wrote: I have been struggling with this for a while now, have even gone down a few paths but struck out, so I turn now to the community for ideas. First, the problem: Store six daily variables for ~ 25 years for cells in a grid. * Number of vars = 6 * Number of cells ~ 13 million * Number of days ~ 9125 (25 * 365) Optimize the store for two different kinds of queries: Query one: Retrieve the value of a single var for all or a portion of the cells for a single day. This is analogous to an image where every pixel is the value of a single var. SELECTFROM d WHERE yr = ? AND yday = ?; SELECTFROM d WHERE yr = ? AND yday = ? AND cell_id IN (?,?,?...); Query two: Retrieve values for all the days or a duration of days for a single var for a single cell. This is like grabbing a column out of a table in which each row holds all the vars for a single day. SELECTFROM d WHERE cell_id = ?; SELECTFROM d WHERE cell_id IN (?,?,?...); First, I must admit to not reading your entire email. I am not sure how to respond to your feedback give that you haven't read the entire email. Nevertheless, thanks for writing... Heh, sorry, my point was, you put a lot of information into your email, and I was going to only use one bit of it: row counts. Second, Query 1 should be fast, regardless of how you layout the tables. It is not fast. Right now I have data for about 250,000 cells loaded. That comes to circa 92 million rows per year. Performance is pretty sucky. This query should return one record, correct? This should be very fast, PG should be able to find the record in the index within 5 seeks, and then find the data in one seek. Can you post 'explain analyze' for this query. (Or, it could be the case, I totally misunderstood your data) Third, Query 2 will return 13M rows? I dont think it matters how you layout the tables, returning 13M rows is always going to be slow. Yes, I understand that. In reality I will never get 13 M rows. For display purposes, I will probably get around 10,000 rows to 50,000 rows. When more rows are needed, it will be to feed a model, so that can be offline (without an impatient human being waiting on the other end). Right now, my main problem is that I have either too many rows (~4 B rows) in a manageable number of tables (25 tables) or manageable number of rows (~13 M rows) in too many tables (~9000 tables). Searching by just cell_id is not going to be very selectable, and with large result-sets I can see this one being slow. As Scott talked about in his response, this one will come down to hardware. Have you dd tested your hardware? -Andy -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] optimizing daily data storage in Pg
Does ordering the data in the table by day, cell_id help at all? How big ARE the indexes we're talking about? If the total size of all the indexes you need to do your queries run into the hundreds of gigabytes, going from 12 to 32 Gigs of RAM may be like flapping your arms our the window of your car in terms of effectiveness. If the indexes that you need at one time add up to something in the 32 to 256Gig range then adding enough memory to hold most or all of them at once would help and is doable, but the price goes up fast when you get to the > 64Gig range. If everything is read only, then pg may or may not be the best fit. It sounds more like what you're doing is batch processing, or at least batch processing friendly. If you could roll up your data, either in the db or beforehand while preparing it, that might be a big win. Is your machine IO bound when running these queries? What does "iostat -xd 1" say about throughput and % utilization? I'm assuming you're running an OS with sysstat available. If you're on something else, then you'll need to research how to see your IO workload on that OS. htop is a nice graphical way of seeing your wait states as well, with the red bars representing IO wait on a machine. If your machine is IO bound, and you've gotten enough ram to hold the working set of your indexes, then you'll need more hard drives under good controllers to make it faster. The bad news is that RAID controllers and lots of hard drives can be expensive, the good news is that reporting servers (which is sounds like what this is) use a lot of sequential access, and throwing more drives at the problem gives big gains, usually. As far as partitioning goes, I think you either need to use fewer partitions, or just use individual tables without using the parent table to access them. It's a well known problem with partitioning that past a few hundred or so child tables things get slow pretty fast. Having 200 to 500 tables, maybe even 1,000 is workable, but past that no, not really. If you're IO bound, then you'll likely need more CPU horsepower. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] optimizing daily data storage in Pg
On Thu, Jul 22, 2010 at 4:56 PM, Andy Colson wrote: > On 7/22/2010 9:41 AM, P Kishor wrote: >> >> I have been struggling with this for a while now, have even gone down >> a few paths but struck out, so I turn now to the community for ideas. >> First, the problem: Store six daily variables for ~ 25 years for cells >> in a grid. >> >> * Number of vars = 6 >> * Number of cells ~ 13 million >> * Number of days ~ 9125 (25 * 365) >> >> Optimize the store for two different kinds of queries: >> >> Query one: Retrieve the value of a single var for all or a portion of >> the cells for a single day. This is analogous to an image where every >> pixel is the value of a single var. > >> SELECT FROM d WHERE yr = ? AND yday = ?; >> SELECT FROM d WHERE yr = ? AND yday = ? AND cell_id IN >> (?,?,?...); > > >> >> Query two: Retrieve values for all the days or a duration of days for >> a single var for a single cell. This is like grabbing a column out of >> a table in which each row holds all the vars for a single day. > >> SELECT FROM d WHERE cell_id = ?; >> SELECT FROM d WHERE cell_id IN (?,?,?...); > > > > First, I must admit to not reading your entire email. I am not sure how to respond to your feedback give that you haven't read the entire email. Nevertheless, thanks for writing... > > Second, Query 1 should be fast, regardless of how you layout the tables. It is not fast. Right now I have data for about 250,000 cells loaded. That comes to circa 92 million rows per year. Performance is pretty sucky. > > Third, Query 2 will return 13M rows? I dont think it matters how you layout > the tables, returning 13M rows is always going to be slow. > Yes, I understand that. In reality I will never get 13 M rows. For display purposes, I will probably get around 10,000 rows to 50,000 rows. When more rows are needed, it will be to feed a model, so that can be offline (without an impatient human being waiting on the other end). Right now, my main problem is that I have either too many rows (~4 B rows) in a manageable number of tables (25 tables) or manageable number of rows (~13 M rows) in too many tables (~9000 tables). > > -Andy > -- Puneet Kishor http://www.punkish.org Carbon Model http://carbonmodel.org Charter Member, Open Source Geospatial Foundation http://www.osgeo.org Science Commons Fellow, http://sciencecommons.org/about/whoweare/kishor Nelson Institute, UW-Madison http://www.nelson.wisc.edu --- Assertions are politics; backing up assertions with evidence is science === -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] optimizing daily data storage in Pg
On 7/22/2010 9:41 AM, P Kishor wrote: I have been struggling with this for a while now, have even gone down a few paths but struck out, so I turn now to the community for ideas. First, the problem: Store six daily variables for ~ 25 years for cells in a grid. * Number of vars = 6 * Number of cells ~ 13 million * Number of days ~ 9125 (25 * 365) Optimize the store for two different kinds of queries: Query one: Retrieve the value of a single var for all or a portion of the cells for a single day. This is analogous to an image where every pixel is the value of a single var. > SELECT FROM d WHERE yr = ? AND yday = ?; > SELECT FROM d WHERE yr = ? AND yday = ? AND cell_id IN (?,?,?...); Query two: Retrieve values for all the days or a duration of days for a single var for a single cell. This is like grabbing a column out of a table in which each row holds all the vars for a single day. > SELECT FROM d WHERE cell_id = ?; > SELECT FROM d WHERE cell_id IN (?,?,?...); First, I must admit to not reading your entire email. Second, Query 1 should be fast, regardless of how you layout the tables. Third, Query 2 will return 13M rows? I dont think it matters how you layout the tables, returning 13M rows is always going to be slow. -Andy -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] optimizing daily data storage in Pg
I have been struggling with this for a while now, have even gone down a few paths but struck out, so I turn now to the community for ideas. First, the problem: Store six daily variables for ~ 25 years for cells in a grid. * Number of vars = 6 * Number of cells ~ 13 million * Number of days ~ 9125 (25 * 365) Optimize the store for two different kinds of queries: Query one: Retrieve the value of a single var for all or a portion of the cells for a single day. This is analogous to an image where every pixel is the value of a single var. Query two: Retrieve values for all the days or a duration of days for a single var for a single cell. This is like grabbing a column out of a table in which each row holds all the vars for a single day. So, I set about designing the db. The "grid" is in a table with 13 million rows CREATE TABLE cells ( cell_id INTEGER, other_data .. ) WITH ( OIDS=FALSE ) A single table *where every row is one day's values for one cell* looks like so CREATE TABLE d ( yr SMALLINT, ydaySMALLINT, a SMALLINT, b SMALLINT, d SMALLINT, e SMALLINT, f SMALLINT, g SMALLINT, cell_id INTEGER ) WITH ( OIDS=FALSE ) The data would look like so yr ydaya b c d e f g cell_id 19801 x x x x x x x 1 .. 1980365 x x x x x x x 1 ... 19811 x x x x x x x 1 .. 1981365 x x x x x x x 1 ... ... 20051 x x x x x x x 1 .. 2005365 x x x x x x x 1 .. 19801 x x x x x x x 2 .. 1980365 x x x x x x x 2 ... I could now (theoretically) conduct my queries like so: Query 1a: Retrieve the value of a single var for all the cells for a single day. This is analogous to an image where every pixel is the value of a single var. SELECT FROM d WHERE yr = ? AND yday = ?; I assuming I would need an index on yr and yday, or perhaps even a compound index on (yr, yday). Query 1b: Retrieve the value of a single var for a portion of the cells for a single day. This is analogous to an image where every pixel is the value of a single var. SELECT FROM d WHERE yr = ? AND yday = ? AND cell_id IN (?,?,?...); I assuming I would need an index on yr and yday, or perhaps even a compound index on (yr, yday) AND an index on cell_id. Query 2: Retrieve values for all the days or a duration of days for a single var for a single cell. This is like grabbing a column out of a table in which each row holds all the vars for a single day. SELECT FROM d WHERE cell_id = ?; SELECT FROM d WHERE cell_id IN (?,?,?...); Once again, an index on cell_id would assist in the above. The problem: The above table would have 13 M * 9125 rows ~ 118 billion rows. Huge indexes, slow queries, etc. In fact, major issues loading the data in the first place. Since I am loading data in batches, I drop the indexes (takes time), COPY data into the table (takes time), build the indexes (takes time), experiment with improving the performance (takes time), fail, rinse, lather, repeat. I actually tried the above with a subset of data (around 100 M rows) and experienced all of the above. I don't remember the query times, but they were awful. So, I partitioned the table into years like so CREATE TABLE d_ ( CHECK ( yr = ) ) INHERITS (d) Hmmm... still no satisfaction. I ended up with 1 master table + 25 inherited tables. Each of the year tables now had ~ 4.75 billion rows (13 M * 365), and the queries were still very slow. So, I partitioned it all by years and days like so CREATE TABLE d__ ( CHECK ( yr = AND yday = ) ) INHERITS (d) Each table now has 13 million rows, and is reasonably fast (although still not satisfactorily fast), but now I have 9K tables. That has its own problems. I can't query the master table anymore as Pg tries to lock all the tables and runs out of memory. Additionally, I can't anymore conduct query two above. I could do something like SELECT a FROM d_1980_1 WHERE cell_id = 1 UNION SELECT a FROM d_1980_2 WHERE cell_id = 1 UNION SELECT a FROM d_1980_3 WHERE cell_id = 1 UNION SELECT a FROM d_1980_4 WHERE cell_id = 1 UNION ... But the above is hardly optimal. Any suggestions, ideas, brainstorms would be appreciated. Perhaps Pg, or even a RDBMS, is not the right tool for this problem, in which case, suggestion for alternatives would be welcome as well. Right now I am testing this on a dual Xeon dual core 3 GHz Xserve with 12 GB RAM. The PGDATA directory is located on an attached RAID that is configured as RAID5. Reasonable time for a query w