Re: [PERFORM] How many tables is too many tables?
[EMAIL PROTECTED] wrote: > I have a database of hundreds of millions of web links (between sites) > in Postgres. For each link, we record the url, the referer, and the > most recent date the link exists. I'm having some serious performance > issues when it comes to writing new data into the database. > > One machine is simply not going to be able to scale with the quantities > of links we hope to store information about and we want to move to some > kind of cluster. Because of the quantities of data, it seems to make > sense to go for a cluster setup such that in a 4 machine cluster, each > machine has a quarter of the data (is this "Share nothing," or, "Share > everything"?). To that end, we figured a good first step was to > partition the data on one machine into multiple tables defining the > logic which would find the appropriate table given a piece of data. > Then, we assumed, adding the logic to find the appropriate machine and > database in our cluster would only be an incremental upgrade. In a database app, you generally don't win by going to a cluster, because you are almost always bound by your I/O. Which means that a single machine, just with more disks, is going to outperform a group of machines. As Tom mentioned, your schema is not very good. So lets discuss what a better schema would be, and also how you might be able to get decent performance with a cluster. First, 200rows * 400,000 tables = 80M rows. Postgres can handle this in a single table without too much difficulty. It all depends on the selectivity of your indexes, etc. I'm not sure how you are trying to normalize your data, but it sounds like having a url table so that each entry can be a simple integer, rather than the full path, considering that you are likely to have a bunch of repeated information. This makes your main table something like 2 integers, plus the interesting stuff (from url, to url, data). If you are finding you are running into I/O problems, you probably could use this layout to move your indexes off onto their own spindles, and maybe separate the main table from the url tables. What is your hardware? What are you trying to do that you don't think will scale? If you were SELECT bound, then maybe a cluster would help you, because you could off-load the SELECTs onto slave machines, and leave your primary machine available for INSERTs and replication. > ... > > At this point, the primary performance bottleneck is in adding > additional data to the database. Our loader program (we load text > files of link information) is currently getting about 40 rows a second, > which is nowhere near the performance we need to be seeing. In theory, > we want to be able to re-write our entire archive of data within on a > 1-2 month cycle, so this is a very heavy write application (though > we're also constantly generating reports from the data, so its not > write only). Are you VACUUMing enough? If you are rewriting all of the data, postgres needs you to clean up afterwards. It is pessimistic, and leaves old rows in their place. > > Is the total number of tables prohibitively affecting our write speed > or is that an IO problem that can only be addressed by better drive > partitioning (all data is on one drive, which I've already read is a > problem)? Is this approach to data partitioning one which makes any > sense for performance, or should we move to a more normal distribution > of links across fewer tables which house more rows each? If all data is on a single drive, you are nowhere near needing a cluster to improve your database. What you need is a 14-drive RAID array. It's probably cheaper than 4x powerful machines, and will provide you with much better performance. And put all of your tables back into one. John =:-> > > Thanks in advance for your advice. > > -matt > > > ---(end of broadcast)--- > TIP 9: In versions below 8.0, the planner will ignore your desire to >choose an index scan if your joining column's datatypes do not >match > signature.asc Description: OpenPGP digital signature
Re: [PERFORM] How many tables is too many tables?
"[EMAIL PROTECTED]" <[EMAIL PROTECTED]> writes: > We currently have around 400,000 tables > and I would estimate that the vast majority of these tables are > relatively small (less than 200 rows). Stop right there, and go redesign your schema. This is unbelievably wrong :-( regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] How many tables is too many tables?
I have a database of hundreds of millions of web links (between sites) in Postgres. For each link, we record the url, the referer, and the most recent date the link exists. I'm having some serious performance issues when it comes to writing new data into the database. One machine is simply not going to be able to scale with the quantities of links we hope to store information about and we want to move to some kind of cluster. Because of the quantities of data, it seems to make sense to go for a cluster setup such that in a 4 machine cluster, each machine has a quarter of the data (is this "Share nothing," or, "Share everything"?). To that end, we figured a good first step was to partition the data on one machine into multiple tables defining the logic which would find the appropriate table given a piece of data. Then, we assumed, adding the logic to find the appropriate machine and database in our cluster would only be an incremental upgrade. We implemented a partitioning scheme that segments the data by the referring domain of each link. This is clearly not the most regular (in terms of even distribution) means of partitioning, but the data in each table is most likely related to each other, so queries would hit the smallest number of tables. We currently have around 400,000 tables and I would estimate that the vast majority of these tables are relatively small (less than 200 rows). Our queries use UNION ALL to combine data from multiple tables (when that's applicable, never more than 1000 tables at once, usually much fewer). When writing to the database, the table for the referring domain is locked while data is added and updated for the whole referring domain at once. We only store one copy of each link, so when loading we have to do a SELECT (for the date) then INSERT or UPDATE where applicable for each link. At this point, the primary performance bottleneck is in adding additional data to the database. Our loader program (we load text files of link information) is currently getting about 40 rows a second, which is nowhere near the performance we need to be seeing. In theory, we want to be able to re-write our entire archive of data within on a 1-2 month cycle, so this is a very heavy write application (though we're also constantly generating reports from the data, so its not write only). Is the total number of tables prohibitively affecting our write speed or is that an IO problem that can only be addressed by better drive partitioning (all data is on one drive, which I've already read is a problem)? Is this approach to data partitioning one which makes any sense for performance, or should we move to a more normal distribution of links across fewer tables which house more rows each? Thanks in advance for your advice. -matt ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] How many tables is too many tables?
<[EMAIL PROTECTED]> wrote > > One machine is simply not going to be able to scale with the quantities > of links we hope to store information about and we want to move to some > kind of cluster. Because of the quantities of data, it seems to make > sense to go for a cluster setup such that in a 4 machine cluster, each > machine has a quarter of the data (is this "Share nothing," or, "Share > everything"?). To that end, we figured a good first step was to > partition the data on one machine into multiple tables defining the > logic which would find the appropriate table given a piece of data. > Then, we assumed, adding the logic to find the appropriate machine and > database in our cluster would only be an incremental upgrade. > So you set up 4 separate copies of PG in 4 machines? This is neither SN or SE. The partition is good for performance if you distribute IOs and CPUs. In your design, I believe IO is distributed (to 4 machines), but since you sliced data into too small pieces, you will get penality from other places. For example, each table has to maintain separate indices (index becomes an useless burden when table is too small), so there will be so many Btree root ... System tables (pg_class/pg_attribute, etc) has to contains many rows to record your tables ... though we cached system table rows, but the memory space is limited ... In short, too many tables. To design your new partition method, jsut keep in mind that database access data in a page-wise IO. Regards, Qingqing ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster