Re: [pmacct-discussion] MySQL connection issues

2009-10-05 Thread Paolo Lucente
Hi Jeremy,

On Mon, Oct 05, 2009 at 01:14:46PM +1000, Jeremy Lee wrote:

 I'm now discovering that there's often a huge time lag before the data
 goes into the database, ranging from a few minutes to up to an hour. I've
 got debug going, 20 seconds between SQL refreshes, and I get several:

Quoting only introduction to the issue for the sake of brevity; you
got the reason why you get nothing for minutes, then suddenly it's all
there: buffering. Perhaps try with incremental steps if you not have
already done that - instead of jumping from 1024 to 10240. Get the
trade-off which better suits your scenario. 

Overall, what peak Mbps is this installation about? Any pps figure?
What i'm trying to figure is why by using buffers of 1K you loose data.
Any chance there is a concurrent process leaking full CPU cycles for a
substantial amount of time which doesn't allow the daemon to cope with
peaks of traffic?

On the persistency of the database connection; i'm open to discussion
and comments on this. I also see it would apply just fine to you. But
let me say some forewords:

* pmacct comes from a persistent connection implementation (many years
  ago); this was dropped because too fragile when adopted as a general
  purpose solution. Hence migrating to a more stateless approach. This
  was for a mix of reasons, mainly: a) some conditions hard to detect:
  server was shut down not properly, firewall, NAT or load-balancers in
  the middle timing out the session or restarting, etc. b) communications
  with the database server always passing through 3rd party APIs; this
  easily translates in not having full control on things.

* Adding a clean option in this sense might require quite some work to
  make it generally applicable, ie. not speaking about a quick fix but
  something which has to be ported (and tested working fine) across the
  multiple database software supported by pmacct.

Cheers,
Paolo


___
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists


Re: [pmacct-discussion] MySQL connection issues

2009-10-05 Thread Jeremy Lee
Thanks for the help, Paolo. I appreciate.

 Quoting only introduction to the issue for the sake of brevity; you
 got the reason why you get nothing for minutes, then suddenly it's all
 there: buffering. Perhaps try with incremental steps if you not have
 already done that - instead of jumping from 1024 to 10240. Get the
 trade-off which better suits your scenario.

Yup, that's pretty much what my experiments have concluded. I may have
been a little hasty in blaming the mysql connector for anything other than
jumping to random, but relatively limited IP interfaces. Once I had enough
allows in my server access list, the real problem became nicely clear,
which is the eight-minute lag between when the aggregates are logically
done (the end of every minute) and when they are put in the database.

 Overall, what peak Mbps is this installation about? Any pps figure?
 What i'm trying to figure is why by using buffers of 1K you loose data.
 Any chance there is a concurrent process leaking full CPU cycles for a
 substantial amount of time which doesn't allow the daemon to cope with
 peaks of traffic?

Well, they run something like 30mbit/s, or around half a million packets a
minute. But I aggregate down to only a dozen records a minute, by a very
short list of subnets that each machine is responsible for. So, lots of
packets in, but very few rows out. But, I need the result to be as
real-time as possible.

In fact, I would love to decrease the aggregation time to every ten
seconds or so, but one minute is the lowest that the documentation says is
possible.  If I have to generate more records in order to flush the buffer
faster, I would prefer to increase the time resolution.

I've tuned plugin_buffer_size and discovered that the 'default' of 104 (I
get that when I try to set it to zero to disable it. is that right?) is
too small. Errors occur and counters get corrupted. 1024 seems to work
nicely at the moment, but I remember was too small when I was aggregating
by IP rather than subnet. I've never had any errors at 10K or above, and
I'd probably prefer to keep it there unless there's a reason to have it
low.

I've always had plugin_pipe_size significantly larger than
plugin_buffer_size. Usually megabytes in size. At least 10:1, more usually
something like 500:1, but I've made it 10,000:1 at times. Not that either
option has much affect to the time lag. They just cause errors if set too
small.

Nothing I've done has significantly changed the eight minute lag which is
even consistent across all the machines. Running 'date' on a console gives
a time eight to ten minutes ahead of the latest database record. If I
restart pmacctd, I get an eight minute gap in the data. If I restart every
five minutes (I got impatient) rows just never get into the database.

Now that I've had some quality time staring at debug logs, tt seems pretty
clear that there is an eight minute queue from the aggregator to the mysql
connector. I'd love some way to flush that queue without generating more
records.

 On the persistency of the database connection; i'm open to discussion
 and comments on this. I also see it would apply just fine to you. But
 let me say some forewords:

 * pmacct comes from a persistent connection implementation (many years
   ago); this was dropped because too fragile when adopted as a general
   purpose solution. Hence migrating to a more stateless approach. This
   was for a mix of reasons, mainly: a) some conditions hard to detect:
   server was shut down not properly, firewall, NAT or load-balancers in
   the middle timing out the session or restarting, etc. b) communications
   with the database server always passing through 3rd party APIs; this
   easily translates in not having full control on things.

MySQL connections especially can be fragile, no argument there. Most
database connections are, because database servers love to reset to known
state as soon as anything goes slightly wrong. But that 'fail fast'
attitude just takes a slightly different approach.

I like your pool of connection managers, but as well as being available to
cope with high loads, I think a couple of them (set by a config option)
should remain connected instead of all shutting down when idle. Keep the
connection alive by regularly executing a query to check the server
connection timeout, say. And if they loose connection, try to reconnect.

PHP uses a persistent connection pool to excellent effect. It works
really, really well, in amazingly hostile environments.

Basically, you treat the last existing connection as a valuable resource.
Not just for the setup and teardown costs, but because if you release a
connection there is no guarantee you can get it back. The MySQL server has
limited connection slots, and pmacctd may be on a machine with limited
TCP/IP sockets when traffic gets heavy.

 * Adding a clean option in this sense might require quite some work to
   make it generally applicable, ie. not speaking about a quick fix but
   

Re: [pmacct-discussion] MySQL connection issues

2009-10-04 Thread Jeremy Lee

 Yes, that's right, the MySQL database is on another machine from pmaccd.
 I
 know this isn't the recommended setup, but (a) the machines are
 co-located

 This is not true. A modular design is generally the way to go for bigger
 installations.

Oh, good. It seemed sensible.

 Both scenarios are indeed supported. RTT of 70ms doesn't look impressive
 but will definitely work.

That's the reliable worst-case. It's usually better, but there are some
machines in other data centres.

 The connection to the database is not persistent; every time the SQL cache
 scanner kicks in (sql_refresh_time), a new connection to the database is
 made.

Ah... In that case, what may be happening is that we are running out of
sockets. At peak loads, there can be so many TCP/IP connections that the
kernel starts to limit them. I've raised this kernel limit in the past,
but there will always be maximal peaks. It is the nature of the internet.
If there was less traffic on those machine, I'd be less interested in
logging it.

That would also explain why I though I saw the daemon switch to another IP
interface in the middle of running. Because it did! All makes sense now...

I don't suppose there's an option to make at least one connection
persistent? Or for the plugin to retry for a few seconds rather than just
die? I can restart pmacctd, but it looses data in that second or two. And
my client IP interface would also stop jumping around, that would be nice.

I'll bet if I was aggregating less and generating a flood of records, then
I'd always stay connected? So my deliberate attempt to keep transactions
low is biting me in the ass? Typical. ;-)

 The mysql_real_connect(), part of the MySQL API, doesn't allow to specify
 an IP address to use for originating a TCP connection to a remote
 database.
 Hence pmacct can do nothing about it; if there is a knob to be configured
 on the MySQL side of the things, that can be very likely in your my.cnf
 MySQL configuration file. If you are successful in this, let us know - it
 could be good to know for other people.

I'll look into it, and let you know what I find. Down to the source code,
if I have to.

 I traditionally totally second Karl's and Wim's position of: PostgreSQL
 rocks; although i'm not sure PostgreSQL would do any better in the same
 pants.

Meh, it's not like I had a choice. LAMP (Linux, Apache, MySQL, PHP) is
basically a defacto standard. It's everywhere, and it mostly works. I
actually picked pmacct because of it's MySQL support. It was the crucial
feature.

Once the data was flowing in, it was quite a boring task to throw it into
some web pages and pretty graphs in an existing management system, which
was the whole point of the exercise. I've got a single page which shows
real-time traffic across multiple servers, all nice and neat.

Now I just have to make sure it keeps working when I turn my back for two
hours.

-- 
Jeremy Lee BCompSci (Hons)
 The Unorthodox Engineers
  www.unorthodox.com.au


___
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists


Re: [pmacct-discussion] MySQL connection issues

2009-10-04 Thread Jeremy Lee

 Couldn't you instead configure pmacctd on the routers to use nfprobe to
 export netflow to the database server and then on the database server run
 nfacctd to collect the netflows from the routers and store to a local
 mysql db.

That sounds needlessly complicated. I'll look into it. :-)

I just want a couple of records in my pre-existing MySQL database from
each server every minute, so I can (a) know the bandwidth RIGHT NOW, and
(b) have a few days of history.

The software has a MySQL connector that is supposed to do the job, that's
why I chose it. The only reason I would not use the built-in connector is
if it doesn't work. In which case there's no reason to use pmacct at all.

Which, frankly, is looking to be the situation.

I'm now discovering that there's often a huge time lag before the data
goes into the database, ranging from a few minutes to up to an hour. I've
got debug going, 20 seconds between SQL refreshes, and I get several:

( default/mysql ) *** Purging cache - START ***
( default/mysql ) *** Purging cache - END (QN: 0, ET: 0) ***
( default/mysql ) *** Purging cache - START ***
( default/mysql ) *** Purging cache - END (QN: 0, ET: 0) ***

In sequence with no records going to the database. And then a bunch of

DEBUG ( default/mysql ): UPDATE `pmacct_james2` SET packets=packets+70,
bytes=bytes+3192, stamp_updated=NOW() WHERE FROM_UNIXTIME(1254708120) =
stamp_inserted AND ip_src='65.19.179.0' AND ip_dst='0.0.0.0'

which I would expect to see every refresh, not once every ten minutes. And
even when the update happens, the data is minutes old.

I think it may have something to do with my buffer sizes, which are
usually set relatively high because otherwise I get:

ERROR ( default/mysql ): We are missing data.
If you see this message once in a while, discard it. Otherwise some
solutions follow:
- increase shared memory size, 'plugin_pipe_size'; now: '10240'.
- increase buffer size, 'plugin_buffer_size'; now: '1024'.
- increase system maximum socket size.

However, it seems that when traffic drops back to lower rates, it takes
forever for these buffers to fill, so minutes go by without a database
update. If I restart pmacctd because I get impatient or think it's
crashed, I loose ten minutes worth of data. Not cool.

I'm just guessing, but I seems like I have a choice of setting them low,
getting regular data, but loosing some when traffic is high (which
corrupts the counters), or setting it high and getting the data very late
when traffic is low. Both options suck. I don't understand why it would be
this way.

Oh, and I've just got a new kind of error I've never seen before:

WARN ( default/core ): eth0 has become unavailable; throttling ...
ERROR ( default/mysql ): PRIMARY 'mysql' backend trouble.
ERROR ( default/mysql ): The SQL server says: Lost connection to MySQL
server at 'reading authorization packet', system error: 0
WARN ( default/core ): eth0 has become unavailable; throttling ...


Here's config for the router I've been discussing:
===
!
debug: true
!
logfile: /var/log/pmacctd.log
!pidfile: /var/run/pmacct/pmacctd.pid
!
interface: eth0
daemonize: false
!promisc: false
plugin_pipe_size: 1024000
plugin_buffer_size: 10240
!
plugins: mysql
aggregate: src_net,dst_net
!
sql_host: .com
sql_db: 
sql_table: pmacct_
sql_optimize_clauses: true
sql_user: XX
sql_passwd: XXX
sql_multi_values: 64000
!sql_dont_try_update: true
sql_history: 1m
sql_refresh_time: 20
sql_history_roundoff: m
!
networks_file: /config/networks.pmacct
! ports_file: /config/ports.pmacct
!
===

Here's a chunk of the log output:
===
DEBUG ( /config/networks.pmacct ): IPv4 Networks Cache successfully
created: 1 entries.
DEBUG ( default/core ): PCAP buffer: obtained 2048000 / 1024000 bytes.
OK ( default/core ): link type is: 1

( default/mysql ) *** Purging cache - START ***
( default/mysql ) *** Purging cache - END (QN: 0, ET: 0) ***
( default/mysql ) *** Purging cache - START ***
( default/mysql ) *** Purging cache - END (QN: 0, ET: 0) ***
( default/mysql ) *** Purging cache - START ***
( default/mysql ) *** Purging cache - END (QN: 0, ET: 0) ***
( default/mysql ) *** Purging cache - START ***
( default/mysql ) *** Purging cache - END (QN: 0, ET: 0) ***
( default/mysql ) *** Purging cache - START ***
( default/mysql ) *** Purging cache - END (QN: 0, ET: 0) ***
( default/mysql ) *** Purging cache - START ***
( default/mysql ) *** Purging cache - END (QN: 0, ET: 0) ***
( default/mysql ) *** Purging cache - START ***
( default/mysql ) *** Purging cache - END (QN: 0, ET: 0) ***
( default/mysql ) *** Purging cache - START ***
( default/mysql ) *** Purging cache - END (QN: 0, ET: 0) ***
( default/mysql ) *** Purging cache - START ***
( default/mysql ) *** Purging cache - END (QN: 0, ET: 0) ***
( default/mysql ) *** Purging cache - START ***
( 

Re: [pmacct-discussion] MySQL connection issues

2009-10-04 Thread Jeremy Lee

 Couldn't you instead configure pmacctd on the routers to use nfprobe to
 export netflow to the database server and then on the database server run
 nfacctd to collect the netflows from the routers and store to a local
 mysql db.

*sigh* And maybe I would, if I could compile pmacct on the webserver.
unfortunately, it doesn't have libpcap installed and I don't have the
privileges to install it. (fancy that; on a hosted webserver, not letting
us capture all their internal network traffic. go figure.)

so configure fails, therefore I can't just make nfacctd.

Plus, in general, installing things on the webserver / db server is about
the last thing you want to have to do in a production environment. That's
why I wanted the MySQL database connection in the first place.

-- 
Jeremy Lee BCompSci (Hons)
 The Unorthodox Engineers
  www.unorthodox.com.au


___
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists


[pmacct-discussion] MySQL connection issues

2009-10-01 Thread Jeremy Lee

Hi there, everybody!

I've got pmacctd installed and running, and it's working rather nicely for
what I need, except for two small issues;

Some background: I have some 'router' boxes, each with a LOT of IP
addresses. I'm using pmaccd and the mysql plugin on those machines to log
results into a central database that sits on a web server.

Yes, that's right, the MySQL database is on another machine from pmaccd. I
know this isn't the recommended setup, but (a) the machines are co-located
quite close together (70ms ping times) (b) the database traffic is very
low, (20-30 records per minute, I aggregate on subnet), and (c) I really
don't want to install MySQL on every router, for security, performance,
and maintenance reasons.

Generally it works very well. But this morning, after running perfectly
all night, pmacctd started dying with the following message:

INFO: connection lost to 'default-mysql'; closing connection.
INFO: no more plugins active. Shutting down.

This persisted for a while, on all the client machines, and then went
away. But I can just feel it waiting to happen again. I've tied messing
with the 'sql_multi_value' parmeter as suggested elsewhere, to no effect.

Is there a way to get the mysql plugin to automatically reconnect to the
server when this happens? I can't seem to find an option anywhere...

Second, I'm noticing that when the mysql plugin connects to the database,
it's not using the 'base' IP address of eth0, but rather one of the other
IP interfaces like eth0:2031. Worse, it seems to bind to a semi-random
IP, and will occasionally switch between runs.

(Yes, these machines actually have hundreds, and even thousands of
addresses. It's their job.)

This creates major problems when you're using security to limit the IP
addresses allowed to connect to the MySQL server, as is standard practice.

I've tried setting the 'nfacctd_ip' parameter to the IP address of eth0,
but it doesn't seem to affect the mysql plugin.

I could potentially solve both problems by (a) writing a 'keepalive'
script to restart the daemon when it dies, and (b) allowing huge ranges of
addresses to connect to the MySQL server, but I don't like kludges.

Any suggestions?


Oh look. One daemon just died again, after running fine for an hour. G.

-- 
Jeremy Lee BCompSci (Hons)
 The Unorthodox Engineers
  www.unorthodox.com.au


___
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists