Re: [pmacct-discussion] PostgreSQL performance
Hi all, I want to share with you an interesting experience I just had. I did the following SELECT: SELECT ip_dst,SUM(bytes),SUM(packets),SUM(flows) FROM tcom_v5_20060530 WHERE stamp_inserted='2006-05-30 00:00:00' AND stamp_inserted'2006-05-30 02:00:00' AND port_src='53' AND port_dst='53' AND ip_src='xxx.xxx.xxx.xxx' GROUP BY ip_dst ORDER BY SUM(bytes) DESC LIMIT '10'; which seemed to never end (30 minutes). I had only two indexes on this table: tcom_v5_20060530_idx btree (ip_src, ip_dst, stamp_inserted) tcom_v5_20060530_stamp_idx btree (stamp_inserted) After dropping the tcom_v5_20060530_idx, the query was answered in 20 seconds. ? Cheers, Sven -- Sven Anderson Institute for Informatics - http://www.ifi.informatik.uni-goettingen.de Georg-August-Universitaet Goettingen Lotzestr. 16-18, 37083 Goettingen, Germany ___ pmacct-discussion mailing list http://www.pmacct.net/#mailinglists
Re: [pmacct-discussion] PostgreSQL performance
On Wed 31 May 2006 19:23, Sven Anderson wrote: Hi all, I want to share with you an interesting experience I just had. I did the following SELECT: SELECT ip_dst,SUM(bytes),SUM(packets),SUM(flows) FROM tcom_v5_20060530 WHERE stamp_inserted='2006-05-30 00:00:00' AND stamp_inserted'2006-05-30 02:00:00' AND port_src='53' AND port_dst='53' AND ip_src='xxx.xxx.xxx.xxx' GROUP BY ip_dst ORDER BY SUM(bytes) DESC LIMIT '10'; which seemed to never end (30 minutes). I had only two indexes on this table: tcom_v5_20060530_idx btree (ip_src, ip_dst, stamp_inserted) tcom_v5_20060530_stamp_idx btree (stamp_inserted) After dropping the tcom_v5_20060530_idx, the query was answered in 20 seconds. When is the last time you did a VACUUM ANALYZE? -- Peter Nixon http://www.peternixon.net/ PGP Key: http://www.peternixon.net/public.asc pgpUrzZXNexl7.pgp Description: PGP signature ___ pmacct-discussion mailing list http://www.pmacct.net/#mailinglists
Re: [pmacct-discussion] PostgreSQL performance
Hi Wim, Wim Kerkhoff, 04.05.2006 06:09: Disabling fsync() does improve performance. I did this once for a Pmacct database, but was bitten horribly after the server was rebooted a couple of times without being shutdown properly (power/UPS failure etc). So don't do that. You mean: all the data was gone, not only the recent one? - enable optimal indexes based on queries being used by pmacctd to update the tables I disabled updates at all. But anyway, with the default standard primary key index over all the update fields there was no chance. Now I have already over 18 million rows with these indexes: Indexes: acct_v5_idx btree (ip_src, ip_dst, stamp_inserted) stamp_inserted_idx btree (stamp_inserted) and all the queries are fast enough as long as a time window is selected, which is not too big. But even adding a hash-table for ip_src was creating trouble. So my problem is definitely correlated with too much indexes. - remove unused indexes I would add or not so important indexes. ;-) - run all SELECT UPDATE under explain analyze ... to see if they are actually hitting the indexes Good hint. BTW.: What is the best indexing to make a select with WHERE stamp_inserted=2006-04-21 20:17:55 AND stamp_inserted2006-04-21 21:17:55 fast? I guess a btree index, but maybe there's something better? This question is still open. No DBMS gurus here? ;-) Are you actually doing queries like this, or is this just an example? Yes, I'm doing grouped sums in certain time slots, where I use queries like this: SELECT src_ip,sum(bytes) FROM acct_v5 GROUP BY src_ip ORDER BY sum(bytes) DESC WHERE stamp_inserted=2006-04-21 20:17:55 AND stamp_inserted2006-04-21 21:17:55; What you could do is an an index based on a group clause. Eg, group by date_trunc('hour', stamp_inserted). Then it will be very fast to find the records from the hour. Then a sequential scan for records in those selected hours should be pretty quick. Will it also intelligently use this index when doing WHERE stamp_inserted=2006-04-21 20:17:55 AND stamp_inserted2006-04-21 21:17:55, so that it looks only in the two touched hours? I have counter tables with 7+ million records in Pg, and no complaints at all from the people hitting web apps that are hitting them. Therefore I created an acct_monthly, acct_daily, and acct_hourly tables. acct_monthly is blazing fast because it has barely any records in it. acct_daily has much less records then acct_hourly. Sounds like the different resolution time lines that are used to be created with RRDTool. Cheers, Sven -- Sven Anderson Institute for Informatics - http://www.ifi.informatik.uni-goettingen.de Georg-August-Universitaet Goettingen Lotzestr. 16-18, 37083 Goettingen, Germany ___ pmacct-discussion mailing list http://www.pmacct.net/#mailinglists
Re: [pmacct-discussion] PostgreSQL performance
Hi Paolo, Paolo Lucente, 01.05.2006 17:20: it seems to be known, that a default MySQL performs a lot better than a default PostgreSQL. One difference is, that PostgreSQL is doing an COMMIT Not to open a possible flame, but i fully agree with this view. However, we massively deal with network objects (IP and MAC addresses, network subnets) which are handled properly only by PostgreSQL. Do you see chances that this can re-open the challenge on this scenario ? what do you mean with properly? Basically these types are just numbers. And these are handled properly by mysql. Sure, doesn't look so nice in a raw table. The problem in the MySQL scheme for me are the strings, especially in ip_proto. For me it looks like PostgreSQL is cool thing if you have complex table relations. But we have plain flat single tables. Maybe even MySQL is an overhead here. ;-) Did you closer look at flowd (http://www.mindrot.org/flowd.html)? It has it's own binary file format for flow data. Maybe this is the right direction? My dream is anyway a backend-independent API for accessing the data, where you don't have to know, where and in which format the data is stored. ;-) Cheers, Sven -- Sven Anderson Institute for Informatics - http://www.ifi.informatik.uni-goettingen.de Georg-August-Universitaet Goettingen Lotzestr. 16-18, 37083 Goettingen, Germany ___ pmacct-discussion mailing list http://www.pmacct.net/#mailinglists
Re: [pmacct-discussion] PostgreSQL performance
Sven Anderson wrote: Hi all, Sven Anderson, 21.04.2006 21:34: I think the problem is not the updating of the data itself, but updating the complex primary key. An index of (ip_src, ip_dst, stamp_inserted) is fast enough to find entries, and easy enough to maintain. it seems to be known, that a default MySQL performs a lot better than a default PostgreSQL. One difference is, that PostgreSQL is doing an COMMIT after each INSERT/UPDATE, which results in an fsync() call, if fsync = true in postgresql.conf. See this article: http://monstera.man.poznan.pl/wiki/index.php/Mysql_vs_postgres Disabling fsync() does improve performance. I did this once for a Pmacct database, but was bitten horribly after the server was rebooted a couple of times without being shutdown properly (power/UPS failure etc). So don't do that. When I was doing PostgreSQL optimization, I put some links on that topic in: http://www.nyetwork.org/wiki/PostgreSQL There's some simple things that can be done in PostgreSQL to really improve performance. My main tips are: - only keep columns in the pmacct tables that are actually being used - check datatypes to see if they can be tightened up - enable optimal indexes based on queries being used by pmacctd to update the tables - enable optimal indexes based on queries being used by SQL reports (eg web pages) to select data - remove unused indexes - run all SELECT UPDATE under explain analyze ... to see if they are actually hitting the indexes - sometimes run set enable_seqscan=no; before a SELECT query can make a big difference BTW.: What is the best indexing to make a select with WHERE stamp_inserted=2006-04-21 20:17:55 AND stamp_inserted2006-04-21 21:17:55 fast? I guess a btree index, but maybe there's something better? This question is still open. No DBMS gurus here? ;-) Are you actually doing queries like this, or is this just an example? What you could do is an an index based on a group clause. Eg, group by date_trunc('hour', stamp_inserted). Then it will be very fast to find the records from the hour. Then a sequential scan for records in those selected hours should be pretty quick. I have counter tables with 7+ million records in Pg, and no complaints at all from the people hitting web apps that are hitting them. Having one big table for everything of course makes queries easier. But often people are only doing hour-to-hour queries within the same day. And generally I found that most reports that people run are for per-month totals by IP address. Therefore I created an acct_monthly, acct_daily, and acct_hourly tables. acct_monthly is blazing fast because it has barely any records in it. acct_daily has much less records then acct_hourly. Although it does have more then acct_monthly, yet finding totals per day is fast since it has indexes that group based on that date_part() function's results. This method uses some extra disk space and causes some extra CPU load because there are three times as many pmacctd processes running, but overall it's worth it since queries are always fast, and pmacctd is never back logged trying frantically to commit... Wim ___ pmacct-discussion mailing list http://www.pmacct.net/#mailinglists
Re: [pmacct-discussion] PostgreSQL performance
Hey Sven, On Fri, Apr 28, 2006 at 02:31:43PM +0200, Sven Anderson wrote: it seems to be known, that a default MySQL performs a lot better than a default PostgreSQL. One difference is, that PostgreSQL is doing an COMMIT Not to open a possible flame, but i fully agree with this view. However, we massively deal with network objects (IP and MAC addresses, network subnets) which are handled properly only by PostgreSQL. Do you see chances that this can re-open the challenge on this scenario ? Paolo, are you using transactions in the PostgreSQL plugin, to COMMIT all UPDATES in one bunch? I couldn't clearly see it in the source code. Sure! Macro-transactions are in use since ages (both in PostgreSQL and SQLite plugins -- grep COMMIT src/*). The approach of the very first versions of the plugin was to use micro-transactions. But intuitively it was performing horrendly compared to the actual approach. Cheers, Paolo ___ pmacct-discussion mailing list http://www.pmacct.net/#mailinglists
Re: [pmacct-discussion] PostgreSQL performance
Hi all, Sven Anderson, 21.04.2006 21:34: I think the problem is not the updating of the data itself, but updating the complex primary key. An index of (ip_src, ip_dst, stamp_inserted) is fast enough to find entries, and easy enough to maintain. it seems to be known, that a default MySQL performs a lot better than a default PostgreSQL. One difference is, that PostgreSQL is doing an COMMIT after each INSERT/UPDATE, which results in an fsync() call, if fsync = true in postgresql.conf. See this article: http://monstera.man.poznan.pl/wiki/index.php/Mysql_vs_postgres Paolo, are you using transactions in the PostgreSQL plugin, to COMMIT all UPDATES in one bunch? I couldn't clearly see it in the source code. BTW.: What is the best indexing to make a select with WHERE stamp_inserted=2006-04-21 20:17:55 AND stamp_inserted2006-04-21 21:17:55 fast? I guess a btree index, but maybe there's something better? This question is still open. No DBMS gurus here? ;-) Cheers, Sven -- Sven Anderson Institute for Informatics - http://www.ifi.informatik.uni-goettingen.de Georg-August-Universitaet Goettingen Lotzestr. 16-18, 37083 Goettingen, Germany ___ pmacct-discussion mailing list http://www.pmacct.net/#mailinglists
Re: [pmacct-discussion] PostgreSQL performance
Hi Paolo and all, Paolo Lucente, 10.04.2006 18:42: digging through the SQL scripts i've just noticed something bad: primary keys of default PostgreSQL tables v2-v5 are just missing the 'vlan' field (while it correctly appears in equivalent MySQL/SQLite 3.x schemas). This may explain the slowness. To verify this, can you please modify the script, place the 'vlan' field inside the primary key (between mac_dst and ip_src), create the default table again and see whether everything now works correctly ? Surprised that we have not caught this before. Let me know. I created a new table with this index: acct_pk primary key, btree (agent_id, class_id, mac_src, mac_dst, vlan, ip_src, ip_dst, port_src, port_dst, ip_proto, tos, stamp_inserted) and I still have the problem. There is alway a long lasting postmaster process: 8115 ?? D 0:02.91 postmaster: pmacct pmacct [local] UPDATE (postgres) which blocks the other DB Writer processes. My other table with these indexes: acct_v5_idx btree (ip_src, ip_dst, stamp_inserted) stamp_inserted_idx hash (stamp_inserted) still works fine with 14731769 rows. I think the problem is not the updating of the data itself, but updating the complex primary key. An index of (ip_src, ip_dst, stamp_inserted) is fast enough to find entries, and easy enough to maintain. BTW.: What is the best indexing to make a select with WHERE stamp_inserted=2006-04-21 20:17:55 AND stamp_inserted2006-04-21 21:17:55 fast? I guess a btree index, but maybe there's something better? Cheers, Sven -- Sven Anderson Institute for Informatics - http://www.ifi.informatik.uni-goettingen.de Georg-August-Universitaet Goettingen Lotzestr. 16-18, 37083 Goettingen, Germany ___ pmacct-discussion mailing list http://www.pmacct.net/#mailinglists ___ pmacct-discussion mailing list http://www.pmacct.net/#mailinglists
Re: [pmacct-discussion] PostgreSQL performance
This one time, at band camp, Sven Anderson wrote: Hi, i'm just testing pmacct on OpenBSD 3.7. First I used MySQL as data backend. Even after several days of capturing into the same table I had no performance problems. But since the MySQL backend uses a string for ip_proto and has no IP address type I decided to switch to PostgreSQL. But now I have massive performance problems. After one day, when there are 500 000 to 700 000 rows in the table, it's getting so slow, that it cannot store the data fast enough any more, resulting in a lot of processes like 22758 ?? I 0:00.00 pmacctd: PostgreSQL Plugin -- DB Writer [default] (pmacctd) and 20427 ?? I 0:00.03 postmaster: pmacct pmacct [local] LOCK TABLE waiting (postgres) Is there a tuning problem, or is PostgreSQL known to be not as fast as MySQL? I thought, maybe there are some kind of rollback journals written, which MySQL doesn't, or the hash-tables are not indexed correctly? Any ideas? What version of pmacct are you using? 0.10.0 has patches to remove the LOCK TABLE if you're in insert-only mode, which I recommend. I also remove the index from the table to speed up the inserts, and partition the data into a new table per day so that each table never grows too big to manage. ___ pmacct-discussion mailing list http://www.pmacct.net/#mailinglists
Re: [pmacct-discussion] PostgreSQL performance
Hi Aaron and all, thank you for your reply. Aaron Glenn, 08.04.2006 01:50: It could be a myriad of issues. Have you changed any of the default PostgreSQL settings? PostgreSQL, by default, is not configured for anything but pure compatibility - meaning it will certainly work, but not at it's best. For what it's worth, I had PostgreSQL and pmacct on an OpenBSD 3.8 box and experienced no problems past the 1.2million row mark (I shutdown that installation for unrelated reasons). If you are not familiar with PostgreSQL - definitely start reading here[1] I changed the values of the kernel and PostgreSQL itself as proposed in README.OpenBSD and INSTALL. But most of the stuff seems to relate to the number of concurrent connections, which are not the problem. Now I removed the primary key constraint of acct_v5: CONSTRAINT acct_uni_pk PRIMARY KEY (agent_id, class_id, mac_src, mac_dst, ip_src, ip_dst, port_src, port_dst, ip_proto, tos, stamp_inserted) and replaced it with two indexes: acct_v5_idx btree (ip_src, ip_dst, stamp_inserted) stamp_inserted_idx hash (stamp_inserted) Now I've got over the weekend 2.6 million rows, and it works ok so far. So the maintaining of the complex index of the primary key seems to make a big difference. How long did it take for the 1.2 million rows in your setup? If the rows are collected slow enough, then the updating of the table is no problem of course. My machine is quite weak, with 900MHz PIII CPU, 320MB RAM and a normal IDE hard disk. But with MySQL it was no problem, so I'm just comparing. Cheers, Sven -- Sven Anderson Institute for Informatics - http://www.ifi.informatik.uni-goettingen.de Georg-August-Universitaet Goettingen Lotzestr. 16-18, 37083 Goettingen, Germany
Re: [pmacct-discussion] PostgreSQL performance
Hey Sven, digging through the SQL scripts i've just noticed something bad: primary keys of default PostgreSQL tables v2-v5 are just missing the 'vlan' field (while it correctly appears in equivalent MySQL/SQLite 3.x schemas). This may explain the slowness. To verify this, can you please modify the script, place the 'vlan' field inside the primary key (between mac_dst and ip_src), create the default table again and see whether everything now works correctly ? Surprised that we have not caught this before. Let me know. Cheers, Paolo
Re: [pmacct-discussion] PostgreSQL performance
On 4/7/06, Sven Anderson [EMAIL PROTECTED] wrote: Is there a tuning problem, or is PostgreSQL known to be not as fast as MySQL? I thought, maybe there are some kind of rollback journals written, which MySQL doesn't, or the hash-tables are not indexed correctly? Any ideas? It could be a myriad of issues. Have you changed any of the default PostgreSQL settings? PostgreSQL, by default, is not configured for anything but pure compatibility - meaning it will certainly work, but not at it's best. For what it's worth, I had PostgreSQL and pmacct on an OpenBSD 3.8 box and experienced no problems past the 1.2million row mark (I shutdown that installation for unrelated reasons). If you are not familiar with PostgreSQL - definitely start reading here[1] regards, aaron.glenn [1]http://www.postgresql.org/docs/8.1/interactive/runtime-config.html