Re: [pmacct-discussion] PostgreSQL performance

2006-05-31 Thread Sven Anderson
Hi all,

I want to share with you an interesting experience I just had.

I did the following SELECT:

SELECT ip_dst,SUM(bytes),SUM(packets),SUM(flows) FROM
tcom_v5_20060530 WHERE stamp_inserted='2006-05-30 00:00:00' AND
stamp_inserted'2006-05-30 02:00:00' AND port_src='53' AND
port_dst='53' AND ip_src='xxx.xxx.xxx.xxx' GROUP BY ip_dst ORDER BY
SUM(bytes) DESC LIMIT '10';

which seemed to never end (30 minutes).

I had only two indexes on this table:
tcom_v5_20060530_idx btree (ip_src, ip_dst, stamp_inserted)
tcom_v5_20060530_stamp_idx btree (stamp_inserted)

After dropping the tcom_v5_20060530_idx, the query was answered in 20 seconds.

?


Cheers,

Sven

-- 
Sven Anderson
Institute for Informatics - http://www.ifi.informatik.uni-goettingen.de
Georg-August-Universitaet Goettingen
Lotzestr. 16-18, 37083 Goettingen, Germany

___
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists


Re: [pmacct-discussion] PostgreSQL performance

2006-05-31 Thread Peter Nixon
On Wed 31 May 2006 19:23, Sven Anderson wrote:
 Hi all,

 I want to share with you an interesting experience I just had.

 I did the following SELECT:

 SELECT ip_dst,SUM(bytes),SUM(packets),SUM(flows) FROM
 tcom_v5_20060530 WHERE stamp_inserted='2006-05-30 00:00:00' AND
 stamp_inserted'2006-05-30 02:00:00' AND port_src='53' AND
 port_dst='53' AND ip_src='xxx.xxx.xxx.xxx' GROUP BY ip_dst ORDER BY
 SUM(bytes) DESC LIMIT '10';

 which seemed to never end (30 minutes).

 I had only two indexes on this table:
 tcom_v5_20060530_idx btree (ip_src, ip_dst, stamp_inserted)
 tcom_v5_20060530_stamp_idx btree (stamp_inserted)

 After dropping the tcom_v5_20060530_idx, the query was answered in 20
 seconds.

When is the last time you did a VACUUM ANALYZE?

-- 

Peter Nixon
http://www.peternixon.net/
PGP Key: http://www.peternixon.net/public.asc


pgpUrzZXNexl7.pgp
Description: PGP signature
___
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists

Re: [pmacct-discussion] PostgreSQL performance

2006-05-12 Thread Sven Anderson
Hi Wim,

Wim Kerkhoff, 04.05.2006 06:09:
 Disabling fsync() does improve performance. I did this once for a Pmacct 
 database, but was bitten horribly after the server was rebooted a couple 
 of times without being shutdown properly (power/UPS failure etc). So 
 don't do that.

You mean: all the data was gone, not only the recent one?

 - enable optimal indexes based on queries being used by pmacctd to 
 update the tables

I disabled updates at all. But anyway, with the default standard primary
key index over all the update fields there was no chance. Now I have
already over 18 million rows with these indexes:

Indexes:
acct_v5_idx btree (ip_src, ip_dst, stamp_inserted)
stamp_inserted_idx btree (stamp_inserted)

and all the queries are fast enough as long as a time window is selected,
which is not too big. But even adding a hash-table for ip_src was creating
trouble. So my problem is definitely correlated with too much indexes.

 - remove unused indexes

I would add or not so important indexes. ;-)

 - run all SELECT  UPDATE under explain analyze ...  to see if they 
 are actually hitting the indexes

Good hint.

 BTW.: What is the best indexing to make a select with WHERE
 stamp_inserted=2006-04-21 20:17:55 AND stamp_inserted2006-04-21
 21:17:55 fast? I guess a btree index, but maybe there's something better?
 
 This question is still open. No DBMS gurus here? ;-)
   
 Are you actually doing queries like this, or is this just an example?

Yes, I'm doing grouped sums in certain time slots, where I use queries
like this:
SELECT src_ip,sum(bytes) FROM acct_v5 GROUP BY src_ip ORDER BY sum(bytes)
DESC WHERE stamp_inserted=2006-04-21 20:17:55 AND
stamp_inserted2006-04-21 21:17:55;

 What you could do is an an index based on a group clause. Eg, group by 
 date_trunc('hour', stamp_inserted). Then it will be very fast to find 
 the records from the hour. Then a sequential scan for records in those 
 selected hours should be pretty quick.

Will it also intelligently use this index when doing WHERE
stamp_inserted=2006-04-21 20:17:55 AND stamp_inserted2006-04-21
21:17:55, so that it looks only in the two touched hours?

 I have counter tables with 7+ million records in Pg, and no complaints 
 at all from the people hitting web apps that are hitting them.

 Therefore I created an acct_monthly, 
 acct_daily, and acct_hourly tables. acct_monthly is blazing fast because 
 it has barely any records in it. acct_daily has much less records then 
 acct_hourly.

Sounds like the different resolution time lines that are used to be
created with RRDTool.


Cheers,

Sven

-- 
Sven Anderson
Institute for Informatics - http://www.ifi.informatik.uni-goettingen.de
Georg-August-Universitaet Goettingen
Lotzestr. 16-18, 37083 Goettingen, Germany

___
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists


Re: [pmacct-discussion] PostgreSQL performance

2006-05-12 Thread Sven Anderson
Hi Paolo,

Paolo Lucente, 01.05.2006 17:20:
 it seems to be known, that a default MySQL performs a lot better than a
 default PostgreSQL. One difference is, that PostgreSQL is doing an COMMIT
 
 Not to open a possible flame, but i fully agree with this view. However, we  
 massively deal with network objects (IP and MAC addresses, network subnets)   

 which are handled properly only by PostgreSQL. Do you see chances that this
 can re-open the challenge on this scenario ?

what do you mean with properly? Basically these types are just numbers.
And these are handled properly by mysql. Sure, doesn't look so nice in a
raw table. The problem in the MySQL scheme for me are the strings,
especially in ip_proto.

For me it looks like PostgreSQL is cool thing if you have complex table
relations. But we have plain flat single tables. Maybe even MySQL is an
overhead here. ;-) Did you closer look at flowd
(http://www.mindrot.org/flowd.html)? It has it's own binary file format
for flow data. Maybe this is the right direction? My dream is anyway a
backend-independent API for accessing the data, where you don't have to
know, where and in which format the data is stored. ;-)


Cheers,

Sven

-- 
Sven Anderson
Institute for Informatics - http://www.ifi.informatik.uni-goettingen.de
Georg-August-Universitaet Goettingen
Lotzestr. 16-18, 37083 Goettingen, Germany

___
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists


Re: [pmacct-discussion] PostgreSQL performance

2006-05-03 Thread Wim Kerkhoff
Sven Anderson wrote:
 Hi all,

 Sven Anderson, 21.04.2006 21:34:
   
 I think the problem is not the updating of the data itself, but updating
 the complex primary key. An index of (ip_src, ip_dst, stamp_inserted) is
 fast enough to find entries, and easy enough to maintain.
 

 it seems to be known, that a default MySQL performs a lot better than a
 default PostgreSQL. One difference is, that PostgreSQL is doing an COMMIT
 after each INSERT/UPDATE, which results in an fsync() call, if fsync =
 true in postgresql.conf. See this article:
 http://monstera.man.poznan.pl/wiki/index.php/Mysql_vs_postgres
   
Disabling fsync() does improve performance. I did this once for a Pmacct 
database, but was bitten horribly after the server was rebooted a couple 
of times without being shutdown properly (power/UPS failure etc). So 
don't do that.

When I was doing PostgreSQL optimization, I put some links on that topic 
in:
http://www.nyetwork.org/wiki/PostgreSQL

There's some simple things that can be done in PostgreSQL to really 
improve performance.

My main tips are:

- only keep columns in the pmacct tables that are actually being used
- check datatypes to see if they can be tightened up
- enable optimal indexes based on queries being used by pmacctd to 
update the tables
- enable optimal indexes based on queries being used by SQL reports (eg 
web pages) to select data
- remove unused indexes
- run all SELECT  UPDATE under explain analyze ...  to see if they 
are actually hitting the indexes
- sometimes run set enable_seqscan=no; before a SELECT query can make 
a big difference
 BTW.: What is the best indexing to make a select with WHERE
 stamp_inserted=2006-04-21 20:17:55 AND stamp_inserted2006-04-21
 21:17:55 fast? I guess a btree index, but maybe there's something better?
 

 This question is still open. No DBMS gurus here? ;-)
   
Are you actually doing queries like this, or is this just an example?

What you could do is an an index based on a group clause. Eg, group by 
date_trunc('hour', stamp_inserted). Then it will be very fast to find 
the records from the hour. Then a sequential scan for records in those 
selected hours should be pretty quick.

I have counter tables with 7+ million records in Pg, and no complaints 
at all from the people hitting web apps that are hitting them.

Having one big table for everything of course makes queries easier. But 
often people are only doing hour-to-hour queries within the same day. 
And generally I found that most reports that people run are for 
per-month totals by IP address. Therefore I created an acct_monthly, 
acct_daily, and acct_hourly tables. acct_monthly is blazing fast because 
it has barely any records in it. acct_daily has much less records then 
acct_hourly. Although it does have more then acct_monthly, yet finding 
totals per day is fast since it has indexes that group based on that 
date_part() function's results. This method uses some extra disk space 
and causes some extra CPU load because there are three times as many 
pmacctd processes running, but overall it's worth it since queries are 
always fast, and pmacctd is never back logged trying frantically to 
commit...

Wim

___
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists


Re: [pmacct-discussion] PostgreSQL performance

2006-05-01 Thread Paolo Lucente
Hey Sven,

On Fri, Apr 28, 2006 at 02:31:43PM +0200, Sven Anderson wrote:

 it seems to be known, that a default MySQL performs a lot better than a
 default PostgreSQL. One difference is, that PostgreSQL is doing an COMMIT

Not to open a possible flame, but i fully agree with this view. However, we  
massively deal with network objects (IP and MAC addresses, network subnets) 
 
which are handled properly only by PostgreSQL. Do you see chances that this
can re-open the challenge on this scenario ?

 Paolo, are you using transactions in the PostgreSQL plugin, to COMMIT all
 UPDATES in one bunch? I couldn't clearly see it in the source code.

Sure! Macro-transactions are in use since ages (both in PostgreSQL and SQLite
plugins -- grep COMMIT src/*). The approach of the very first versions of
the plugin was to use micro-transactions. But intuitively it was performing 
 
horrendly compared to the actual approach.


Cheers,
Paolo


___
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists


Re: [pmacct-discussion] PostgreSQL performance

2006-04-28 Thread Sven Anderson
Hi all,

Sven Anderson, 21.04.2006 21:34:
 I think the problem is not the updating of the data itself, but updating
 the complex primary key. An index of (ip_src, ip_dst, stamp_inserted) is
 fast enough to find entries, and easy enough to maintain.

it seems to be known, that a default MySQL performs a lot better than a
default PostgreSQL. One difference is, that PostgreSQL is doing an COMMIT
after each INSERT/UPDATE, which results in an fsync() call, if fsync =
true in postgresql.conf. See this article:
http://monstera.man.poznan.pl/wiki/index.php/Mysql_vs_postgres

Paolo, are you using transactions in the PostgreSQL plugin, to COMMIT all
UPDATES in one bunch? I couldn't clearly see it in the source code.

 BTW.: What is the best indexing to make a select with WHERE
 stamp_inserted=2006-04-21 20:17:55 AND stamp_inserted2006-04-21
 21:17:55 fast? I guess a btree index, but maybe there's something better?

This question is still open. No DBMS gurus here? ;-)


Cheers,

Sven

-- 
Sven Anderson
Institute for Informatics - http://www.ifi.informatik.uni-goettingen.de
Georg-August-Universitaet Goettingen
Lotzestr. 16-18, 37083 Goettingen, Germany

___
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists


Re: [pmacct-discussion] PostgreSQL performance

2006-04-22 Thread Sven Anderson
Hi Paolo and all,

Paolo Lucente, 10.04.2006 18:42:
 digging through the SQL scripts i've just noticed something bad: primary
 keys of default PostgreSQL tables v2-v5 are just missing the 'vlan' field
 (while it correctly appears in equivalent MySQL/SQLite 3.x schemas).
 This may explain the slowness. To verify this, can you please modify the
 script, place the 'vlan' field inside the primary key (between mac_dst
 and ip_src), create the default table again and see whether everything
 now works correctly ? Surprised that we have not caught this before.
 Let me know.

I created a new table with this index:
acct_pk primary key, btree (agent_id, class_id, mac_src, mac_dst,
vlan, ip_src, ip_dst, port_src, port_dst, ip_proto, tos, stamp_inserted)

and I still have the problem. There is alway a long lasting postmaster
process:
8115 ??  D   0:02.91 postmaster: pmacct pmacct [local] UPDATE (postgres)

which blocks the other DB Writer processes.

My other table with these indexes:
acct_v5_idx btree (ip_src, ip_dst, stamp_inserted)
stamp_inserted_idx hash (stamp_inserted)

still works fine with 14731769 rows.

I think the problem is not the updating of the data itself, but updating
the complex primary key. An index of (ip_src, ip_dst, stamp_inserted) is
fast enough to find entries, and easy enough to maintain.

BTW.: What is the best indexing to make a select with WHERE
stamp_inserted=2006-04-21 20:17:55 AND stamp_inserted2006-04-21
21:17:55 fast? I guess a btree index, but maybe there's something better?


Cheers,

Sven

-- 
Sven Anderson
Institute for Informatics - http://www.ifi.informatik.uni-goettingen.de
Georg-August-Universitaet Goettingen
Lotzestr. 16-18, 37083 Goettingen, Germany
___
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists

___
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists


Re: [pmacct-discussion] PostgreSQL performance

2006-04-19 Thread Jamie Wilkinson
This one time, at band camp, Sven Anderson wrote:
Hi,

i'm just testing pmacct on OpenBSD 3.7. First I used MySQL as data
backend. Even after several days of capturing into the same table I had no
performance problems. But since the MySQL backend uses a string for
ip_proto and has no IP address type I decided to switch to PostgreSQL.

But now I have massive performance problems. After one day, when there are
 500 000 to 700 000 rows in the table, it's getting so slow, that it
cannot store the data fast enough any more, resulting in a lot of
processes like

22758 ??  I   0:00.00 pmacctd: PostgreSQL Plugin -- DB Writer
[default] (pmacctd)

and

20427 ??  I   0:00.03 postmaster: pmacct pmacct [local] LOCK TABLE
waiting (postgres)

Is there a tuning problem, or is PostgreSQL known to be not as fast as
MySQL? I thought, maybe there are some kind of rollback journals written,
which MySQL doesn't, or the hash-tables are not indexed correctly? Any ideas?

What version of pmacct are you using?

0.10.0 has patches to remove the LOCK TABLE if you're in insert-only mode,
which I recommend.

I also remove the index from the table to speed up the inserts, and
partition the data into a new table per day so that each table never grows
too big to manage.
___
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists



Re: [pmacct-discussion] PostgreSQL performance

2006-04-10 Thread Sven Anderson
Hi Aaron and all,

thank you for your reply.

Aaron Glenn, 08.04.2006 01:50:
 It could be a myriad of issues. Have you changed any of the default
 PostgreSQL settings? PostgreSQL, by default, is not configured for
 anything but pure compatibility - meaning it will certainly work, but
 not at it's best. For what it's worth, I had PostgreSQL and pmacct on
 an OpenBSD 3.8 box and experienced no problems past the 1.2million row
 mark (I shutdown that installation for unrelated reasons). If you are
 not familiar with PostgreSQL - definitely start reading here[1]

I changed the values of the kernel and PostgreSQL itself as proposed in
README.OpenBSD and INSTALL. But most of the stuff seems to relate to the
number of concurrent connections, which are not the problem. Now I removed
the primary key constraint of acct_v5:

CONSTRAINT acct_uni_pk PRIMARY KEY (agent_id, class_id, mac_src, mac_dst,
ip_src, ip_dst, port_src, port_dst, ip_proto, tos, stamp_inserted)

and replaced it with two indexes:
acct_v5_idx btree (ip_src, ip_dst, stamp_inserted)
stamp_inserted_idx hash (stamp_inserted)

Now I've got over the weekend 2.6 million rows, and it works ok so far. So
the maintaining of the complex index of the primary key seems to make a
big difference.

How long did it take for the 1.2 million rows in your setup? If the rows
are collected slow enough, then the updating of the table is no problem of
course. My machine is quite weak, with 900MHz PIII CPU, 320MB RAM and a
normal IDE hard disk. But with MySQL it was no problem, so I'm just comparing.


Cheers,

Sven

-- 
Sven Anderson
Institute for Informatics - http://www.ifi.informatik.uni-goettingen.de
Georg-August-Universitaet Goettingen
Lotzestr. 16-18, 37083 Goettingen, Germany


Re: [pmacct-discussion] PostgreSQL performance

2006-04-10 Thread Paolo Lucente
Hey Sven,
digging through the SQL scripts i've just noticed something bad: primary
keys of default PostgreSQL tables v2-v5 are just missing the 'vlan' field
(while it correctly appears in equivalent MySQL/SQLite 3.x schemas).
This may explain the slowness. To verify this, can you please modify the
script, place the 'vlan' field inside the primary key (between mac_dst
and ip_src), create the default table again and see whether everything
now works correctly ? Surprised that we have not caught this before.
Let me know.

Cheers,
Paolo



Re: [pmacct-discussion] PostgreSQL performance

2006-04-08 Thread Aaron Glenn
On 4/7/06, Sven Anderson [EMAIL PROTECTED] wrote:

 Is there a tuning problem, or is PostgreSQL known to be not as fast as
 MySQL? I thought, maybe there are some kind of rollback journals written,
 which MySQL doesn't, or the hash-tables are not indexed correctly? Any ideas?

It could be a myriad of issues. Have you changed any of the default
PostgreSQL settings? PostgreSQL, by default, is not configured for
anything but pure compatibility - meaning it will certainly work, but
not at it's best. For what it's worth, I had PostgreSQL and pmacct on
an OpenBSD 3.8 box and experienced no problems past the 1.2million row
mark (I shutdown that installation for unrelated reasons). If you are
not familiar with PostgreSQL - definitely start reading here[1]

regards,
aaron.glenn

[1]http://www.postgresql.org/docs/8.1/interactive/runtime-config.html