Hi,

I've gotten down to only one collestd-client.rb process (see below). Are the multiple kvm-probes OK?

        Regards,
          Gerry




root@host101:~# ps -ef | grep one
oneadmin 3349 1 0 12:23 ? 00:00:00 ruby /var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores 4124 20 0 host101.scss.tcd.ie oneadmin 21068 3349 0 12:51 ? 00:00:00 /bin/bash /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores 4124 20 0 host101.scss.tcd.ie oneadmin 21076 21068 0 12:51 ? 00:00:00 /bin/bash /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores 4124 20 0 host101.scss.tcd.ie oneadmin 21077 21076 0 12:51 ? 00:00:00 /bin/bash /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores 4124 20 0 host101.scss.tcd.ie




On 21/01/2014 10:10, Javier Fontan wrote:
It seems that there are more people having this problem and we are
taking a look on several ways to fix this. One problem with /var/run
is that it is normally owned by root and a process started by oneadmin
user can not write there. In the frontend a new directory for
OpenNebula pid files is created but in the nodes it does not exist.

On Tue, Jan 21, 2014 at 8:07 AM, Gerry O'Brien <[email protected]> wrote:
Hi Javier,

   See my previous email. Another scenario is when
"/tmp/one-collectd-client.pid" does not exist due to issues with /tmp.

    A change seems to have been made to put a pid file in /tmp instead of
/run or /var/run.

         Regards,
           Gerry



On 20/01/2014 17:44, Javier Fontan wrote:
I've been trying to reproduce the problem, that is, making OpenNebula
start a high amount of collectd-client processes. The only way I was
able to do it is when the file "/tmp/one-collectd-client.pid" exists
and has wrong permissions. Can you check the ownership and permissions
of that file?

On Mon, Jan 20, 2014 at 4:15 PM, Javier Fontan <[email protected]>
wrote:
The problem seems to be the high amount of collectd processes running.
Try killing all "collectd-client.rb" processes. There should be only
one running per host.

In case you want to use the old method of monitoring you can follow this
guide:


http://docs.opennebula.org/stable/administration/monitoring/imsshpullg.html#imsshpullg

On Mon, Jan 20, 2014 at 2:17 PM, Gerry O'Brien <[email protected]> wrote:
Hi Ruben,

      Below is the output of 'ps -ef | grep one' on a host that has been
disabled, rebooted and enabled. There are multiple versions of
collectd-client.rb kvm running.


      We have discovered today a serious issue that is having an adverse
effect on our DNS system. When the machines below was enabled,
immediately
our DNS server is flooded with requests from the host (see a sample
below).
       Our logs show that this has only started happening since the
upgrade to
4.4. If we don't get a fix for this we will have to go back to 4.2,
which is
something I really don't want to do.

          Regards,
              Gerry




oneadmin  3628     1  0 13:04 ?        00:00:00 ruby
/var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores
4124
20 0 host101.scss.tcd.ie
oneadmin  4600     1  0 13:05 ?        00:00:00 ruby
/var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores
4124
20 0 host101.scss.tcd.ie
oneadmin  6400     1  0 13:07 ?        00:00:00 ruby
/var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores
4124
20 0 host101.scss.tcd.ie
oneadmin  9003     1  0 13:08 ?        00:00:00 ruby
/var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores
4124
20 0 host101.scss.tcd.ie
oneadmin 12953  3628  0 13:10 ?        00:00:00 /bin/bash
/var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores
4124
20 0 host101.scss.tcd.ie
oneadmin 12955  6400  0 13:10 ?        00:00:00 /bin/bash
/var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores
4124
20 0 host101.scss.tcd.ie
oneadmin 12969 12953  0 13:10 ?        00:00:00 /bin/bash
/var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores
4124
20 0 host101.scss.tcd.ie
oneadmin 12970 12969  0 13:10 ?        00:00:00 /bin/bash
/var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores
4124
20 0 host101.scss.tcd.ie
oneadmin 12972 12955  0 13:10 ?        00:00:00 /bin/bash
/var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores
4124
20 0 host101.scss.tcd.ie
oneadmin 12973 12972  0 13:10 ?        00:00:00 /bin/bash
/var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores
4124
20 0 host101.scss.tcd.ie
oneadmin 13029 12973  0 13:10 ?        00:00:00 /bin/bash
./monitor_ds.sh
kvm-probes /var/lib/one//datastores 4124 20 0 host101.scss.tcd.ie
oneadmin 13030 12970  0 13:10 ?        00:00:00 /bin/bash
./monitor_ds.sh
kvm-probes /var/lib/one//datastores 4124 20 0 host101.scss.tcd.ie



-2014 13:14:26.675 client 134.226.59.101#52314: query:
host101.scss.tcd.ie
IN AAAA + (134.226.32.57)
20-Jan-2014 13:14:26.680 client 134.226.59.101#51356: query:
host101.scss.tcd.ie IN A + (134.226.32.57)
20-Jan-2014 13:14:26.680 client 134.226.59.101#51356: query:
host101.scss.tcd.ie IN AAAA + (134.226.32.57)
20-Jan-2014 13:14:26.822 client 134.226.59.101#47870: query:
host101.scss.tcd.ie IN A + (134.226.32.57)
20-Jan-2014 13:14:26.822 client 134.226.59.101#47870: query:
host101.scss.tcd.ie IN AAAA + (134.226.32.57)
20-Jan-2014 13:14:26.824 client 134.226.59.101#58734: query:
host101.scss.tcd.ie IN A + (134.226.32.57)
20-Jan-2014 13:14:26.825 client 134.226.59.101#58734: query:
host101.scss.tcd.ie IN AAAA + (134.226.32.57)
20-Jan-2014 13:14:26.952 client 134.226.59.101#39659: query:
host101.scss.tcd.ie IN A + (134.226.32.57)
20-Jan-2014 13:14:26.952 client 134.226.59.101#39659: query:
host101.scss.tcd.ie IN AAAA + (134.226.32.57)
20-Jan-2014 13:14:26.952 client 134.226.59.101#53975: query:
host101.scss.tcd.ie IN A + (134.226.32.57)
20-Jan-2014 13:14:26.953 client 134.226.59.101#53975: query:
host101.scss.tcd.ie IN AAAA + (134.226.32.57)
20-Jan-2014 13:14:27.108 client 134.226.59.101#36294: query:
host101.scss.tcd.ie IN A + (134.226.32.57)
20-Jan-2014 13:14:27.108 client 134.226.59.101#36294: query:
host101.scss.tcd.ie IN AAAA + (134.226.32.57)
20-Jan-2014 13:14:27.109 client 134.226.59.101#59277: query:
host101.scss.tcd.ie IN A + (134.226.32.57)
20-Jan-2014 13:14:27.109 client 134.226.59.101#59277: query:
host101.scss.tcd.ie IN AAAA + (134.226.32.57)
20-Jan-2014 13:14:27.347 client 134.226.59.101#49614: query:
host101.scss.tcd.ie IN A + (134.226.32.57)
20-Jan-2014 13:14:27.348 client 134.226.59.101#49614: query:
host101.scss.tcd.ie IN AAAA + (134.226.32.57)
20-Jan-2014 13:14:27.350 client 134.226.59.101#44058: query:
host101.scss.tcd.ie IN A + (134.226.32.57)
20-Jan-2014 13:14:27.357 client 134.226.59.101#44058: query:
host101.scss.tcd.ie IN AAAA + (134.226.32.57)
20-Jan-2014 13:14:27.458 client 134.226.59.101#51830: query:
host101.scss.tcd.ie IN A + (134.226.32.57)
20-Jan-2014 13:14:27.458 client 134.226.59.101#51830: query:
host101.scss.tcd.ie IN AAAA + (134.226.32.57)
20-Jan-2014 13:14:27.461 client 134.226.59.101#38419: query:
host101.scss.tcd.ie IN A + (134.226.32.57)
20-Jan-2014 13:14:27.461 client 134.226.59.101#38419: query:
host101.scss.tcd.ie IN AAAA + (134.226.32.57)
20-Jan-2014 13:14:31.184 client 134.226.59.101#38617: query:
host101.scss.tcd.ie IN A + (134.226.32.57)
20-Jan-2014 13:14:31.184 client 134.226.59.101#38617: query:
host101.scss.tcd.ie IN AAAA + (134.226.32.57)
20-Jan-2014 13:14:31.302 client 134.226







On 17/01/2014 17:45, Ruben S. Montero wrote:
Hi Gerry

Just to check, are you using 4.4 Final? We've seen this in the betas
and
"thought" we fixed for the final version. Also could you check that
there
are just one monitorization process at the hosts (collectd-client.sh,
or
equiv should be the name of the process)

Also could you send us the lines from oned.log between Thu Jan 16
16:56:25
2014 and Thu Jan 16 17:25:43 2014; plus the first lines that includes
you
oned.conf values (we are interested specially in those related to
monitoring interval)


Cheers

Ruben




On Fri, Jan 17, 2014 at 2:27 PM, Gerry O'Brien <[email protected]>
wrote:

Hi,

       Below is a truncated log file for a VM. The monitor continually
cycles
through finding the machine RUNNING and stat UNKNOWN. This occurs for
many
many machines at the same time. All machines were created by a script.

       The VMs are Microsoft Windows 7 64bit Enterprise. Individual
context
is created by a startup script. They run fine but eventually
/var/log/one
is going overflow.

       Restarting oned seems to fix the problem but this is hardly a
long
term solution.

       Any suggestions on what could be causing this?

           Regards,
               Gerry




Thu Jan 16 16:56:21 2014 [DiM][I]: New VM state is ACTIVE.
Thu Jan 16 16:56:22 2014 [LCM][I]: New VM state is PROLOG.
Thu Jan 16 16:56:22 2014 [VM][I]: Virtual Machine has no context
Thu Jan 16 16:56:22 2014 [LCM][I]: New VM state is BOOT
Thu Jan 16 16:56:22 2014 [VMM][I]: Generating deployment file:
/var/lib/one/vms/1788/deployment.0
Thu Jan 16 16:56:23 2014 [VMM][I]: ExitCode: 0
Thu Jan 16 16:56:23 2014 [VMM][I]: Successfully execute network driver
operation: pre.
Thu Jan 16 16:56:25 2014 [VMM][I]: ExitCode: 0
Thu Jan 16 16:56:25 2014 [VMM][I]: Successfully execute virtualization
driver operation: deploy.
Thu Jan 16 16:56:25 2014 [VMM][I]: ExitCode: 0
Thu Jan 16 16:56:25 2014 [VMM][I]: Successfully execute network driver
operation: post.
Thu Jan 16 16:56:25 2014 [LCM][I]: New VM state is RUNNING
Thu Jan 16 16:56:51 2014 [LCM][I]: New VM state is UNKNOWN
Thu Jan 16 16:59:01 2014 [VMM][I]: VM found again, state is RUNNING
Thu Jan 16 16:59:23 2014 [LCM][I]: New VM state is UNKNOWN
Thu Jan 16 17:01:41 2014 [VMM][I]: VM found again, state is RUNNING
Thu Jan 16 17:01:58 2014 [LCM][I]: New VM state is UNKNOWN
Thu Jan 16 17:04:18 2014 [VMM][I]: VM found again, state is RUNNING
Thu Jan 16 17:04:39 2014 [LCM][I]: New VM state is UNKNOWN
Thu Jan 16 17:06:55 2014 [VMM][I]: VM found again, state is RUNNING
Thu Jan 16 17:07:06 2014 [LCM][I]: New VM state is UNKNOWN
Thu Jan 16 17:09:31 2014 [VMM][I]: VM found again, state is RUNNING
Thu Jan 16 17:09:31 2014 [LCM][I]: New VM state is UNKNOWN
Thu Jan 16 17:12:22 2014 [VMM][I]: VM found again, state is RUNNING
Thu Jan 16 17:12:27 2014 [LCM][I]: New VM state is UNKNOWN
Thu Jan 16 17:15:11 2014 [VMM][I]: VM found again, state is RUNNING
Thu Jan 16 17:15:22 2014 [LCM][I]: New VM state is UNKNOWN
Thu Jan 16 17:17:49 2014 [VMM][I]: VM found again, state is RUNNING
Thu Jan 16 17:18:00 2014 [LCM][I]: New VM state is UNKNOWN
Thu Jan 16 17:20:27 2014 [VMM][I]: VM found again, state is RUNNING
Thu Jan 16 17:20:34 2014 [LCM][I]: New VM state is UNKNOWN
Thu Jan 16 17:23:04 2014 [VMM][I]: VM found again, state is RUNNING
Thu Jan 16 17:23:08 2014 [LCM][I]: New VM state is UNKNOWN
Thu Jan 16 17:25:41 2014 [VMM][I]: VM found again, state is RUNNING
Thu Jan 16 17:25:43 2014 [LCM][I]: New VM state is UNKNOWN

--
Gerry O'Brien

Systems Manager
School of Computer Science and Statistics
Trinity College Dublin
Dublin 2
IRELAND

00 353 1 896 1341

_______________________________________________
Users mailing list
[email protected]
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org

--
Gerry O'Brien

Systems Manager
School of Computer Science and Statistics
Trinity College Dublin
Dublin 2
IRELAND

00 353 1 896 1341

_______________________________________________
Users mailing list
[email protected]
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org


--
Javier Fontán Muiños
Developer
OpenNebula - The Open Source Toolkit for Data Center Virtualization
www.OpenNebula.org | @OpenNebula | github.com/jfontan



--
Gerry O'Brien

Systems Manager
School of Computer Science and Statistics
Trinity College Dublin
Dublin 2
IRELAND

00 353 1 896 1341





--
Gerry O'Brien

Systems Manager
School of Computer Science and Statistics
Trinity College Dublin
Dublin 2
IRELAND

00 353 1 896 1341

_______________________________________________
Users mailing list
[email protected]
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org

Reply via email to