Hi
A small report from the trenches after we've had a visit from someone doing a
directory-attack on one of our servers, and the effect this had.
I have a few MRTG graphs to show, but they are located at an image-host (no
nasties - just a direct link to the images):
* last 2 hours showing the effect after fixing the problem at ~11:55
- http://billedhost.dk/filer/1221728072cpu_load_last_2_hours.png
http://billedhost.dk/filer/1221728072cpu_load_last_2_hours.png
* last 24 hours showing the effect of the problem -
http://www.billedhost.dk/filer/1221728228cpu_load_last_day.png
http://www.billedhost.dk/filer/1221728228cpu_load_last_day.png
* last week showing what the normal load usually is (with a few spikes
here and there) and what the directory-attack did -
http://www.billedhost.dk/filer/1221728247cpu_load_last_week.png
http://www.billedhost.dk/filer/1221728247cpu_load_last_week.png
The history:
At around 17:49 Sunday evening someone began a directory-attack on one of our
servers. The attack was logged like this in Apache's httpd-access.log:
aadi 24.249.18.233 - - [14/Sep/2008:17:49:33 +0200] GET /~aadi/ HTTP/1.1 404
204 - - - - - -
aaliyah 24.249.18.233 - - [14/Sep/2008:17:49:33 +0200] GET /~aaliyah/
HTTP/1.1 404 207 - -- - - -
aaralyn 24.249.18.233 - - [14/Sep/2008:17:49:34 +0200] GET /~aaralyn/
HTTP/1.1 404 207 - -- - - -
aaron 24.249.18.233 - - [14/Sep/2008:17:49:34 +0200] GET /~aaron/ HTTP/1.1
404 205 - -- - - -
abba 24.249.18.233 - - [14/Sep/2008:17:49:34 +0200] GET /~abba/ HTTP/1.1 404
204 - - - - - -
abbie 24.249.18.233 - - [14/Sep/2008:17:49:34 +0200] GET /~abbie/ HTTP/1.1
404 205 - -- - - -
(To help understand the fields, this is the logformat we use: %V %h %l %u %t
\%r\ %s %b \%{Referer}i\ \%{User-agent}i\ \ \%{Via}i\ \%{Pragma}i\
\%{X-Forwarded-For}i\ \ \%{Cache-Control}i\)
In total there was ~14.500 such tries spread equally across 5 vhosts - each one
running their own Resin-instance. On average the ~2900 hits pr. vhost is really
nothing, and would have gone unnoticed if the CPU load hadn't begun to rise
above normal on Monday morning (anything above 1 for an extended period is
abnormal for this server).
So Monday I began looking at what happened on the server. The individual Resin
processes were normal - stacktraces of the running VM's showed nothing, and
stopping/starting them did next to nothing (this is a production-server, so
there was percious little I could try without disrupting services for several
thousand users).
The interesting thing was that, according to the OS (RHEL5.1 x64), it was
Apache who was responsible for the high load - not Resin. Stopping the main
Resin process did lower the load, but as soon as it was started again the load
rose back to the abormal level - and since we hadn't deployed a new version for
a few weeks (and the stacktraces showed nothing of interest) we ruled out our
own code as the culprit.
I began to look elsewhere.. ran a few rootkit-dectors against a know list of
sha1sum's in an attempt to see if (and hopefully not!) there were anything
rotten, but nothing.
Then I began to wonder - even though Apache's /server-status showed a rather
normal load and nothing extraordinary (except 180-190% CPU usage), it must be
possible to see what the individual Apache processes were doing.
I attached to a process with $ strace -p pid and began to look at the system
calls, and I soon began to wonder about a lot of these calls:
open(/tmp/resintmp-DmLgMw, O_RDWR|O_CREAT|O_EXCL, 0600) = 44
write(44, H\0\16check-intervalS\0\0015H\0\6cookieS\0..., 16379) = 16379
...
write(44, last-updateS\0\n1073741823h\0\0c\0\0e\0..., 1818) = 1818
close(44) = 0
rename(/tmp/resintmp-DmLgMw, /tmp/localhost_6856) = 0
unlink(/tmp/resintmp-DmLgMw) = -1 ENOENT (No such file or directory)
stat(/tmp/localhost_6856, {st_mode=S_IFREG|0600, st_size=198379, ...}) = 0
...
open(/tmp/resintmp-K8ylXW, O_RDWR|O_CREAT|O_EXCL, 0600) = 44
write(44, H\0\16check-intervalS\0\0015H\0\6cookieS\0..., 16379) = 16379
...
write(44, last-updateS\0\n1073741823h\0\0c\0\0e\0..., 1818) = 1818
close(44) = 0
rename(/tmp/resintmp-K8ylXW, /tmp/localhost_6856) = 0
unlink(/tmp/resintmp-K8ylXW) = -1 ENOENT (No such file or directory)
Then I remember seeing a post about localhost_srun port files on the
mailinglist from Vlad Artamonov (03 Aug 2008) and Scott Ferguson's reply on the
4th.
Oh dear - I had some large localhost_srun port files:
-rw--- 1 apache apache 198379 Sep 16 11:22 /tmp/localhost_6856
-rw--- 1 apache apache 176417 Sep 16 11:22 /tmp/localhost_6862
-rw--- 1 apache apache 766 Sep 16 11:26 /tmp/localhost_6873
-rw--- 1 apache apache 152038 Sep 16 11:20 /tmp/localhost_6880
-rw--- 1 apache apache 139985 Sep 16