[PATCHv2] usb/net/asix_devices: Add USBNET HG20F9 ethernet dongle

2013-02-27 Thread Glen Turner
This USB ethernet adapter was purchased in anodyne packaging
from the computer store adjacent to linux.conf.au 2013 in
Canberra (Australia). A web search shows other recent
purchasers in Lancaster (UK) and Seattle (USA). Just like an
emergent virus, our age of e-commerce and airmail allows
underdocumented hardware to spread around the world instantly
using the vector of ridiculously low prices.

Paige Thompson, infected via eBay, discovered that the HG20F9
is a copy of the Asix 88772B; many viruses copy the RNA of
other viruses. See Paige's work at
<https://github.com/paigeadele/HG20F9>.
This patch uses her discovery to update the restructured Asix
driver in the current kernel.

Just as some viruses inhabit seemingly-healthy cells, the
HG20F9 uses the Vendor ID 0x066b assigned to Linksys Inc.
For the present there is no clash of Product ID 0x20f9.

Signed-off-by: Glen Turner 
---
David,

My apologies for the patch failing to compile. I worked off Linus' GIT tree
from a week ago, which I now realise was ignorant. Today I downloaded
,
modified the patch to suit, compiled with a distributor's .config (ten hours
on my EeePC 901), and re-tested against a range of switches and traffic.

Thanks to Bjørn Mork for the heads-up about comment coding style. I had
copied the style used throughout asix_devices.c but one never knows where
being on the bad side of an automated tool can lead, now that the creator
of the Daleks can no longer protect us from robots with rigid notions
of the acceptable and the exterminatable.

-glen

 drivers/net/usb/asix_devices.c |   31 +++
 1 file changed, 31 insertions(+)

diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
index 2205dbc..7097534 100644
--- a/drivers/net/usb/asix_devices.c
+++ b/drivers/net/usb/asix_devices.c
@@ -924,6 +924,29 @@ static const struct driver_info ax88178_info = {
.tx_fixup = asix_tx_fixup,
 };
 
+/*
+ * USBLINK 20F9 "USB 2.0 LAN" USB ethernet adapter, typically found in
+ * no-name packaging.
+ * USB device strings are:
+ *   1: Manufacturer: USBLINK
+ *   2: Product: HG20F9 USB2.0
+ *   3: Serial: 03
+ * Appears to be compatible with Asix 88772B.
+ */
+static const struct driver_info hg20f9_info = {
+   .description = "HG20F9 USB 2.0 Ethernet",
+   .bind = ax88772_bind,
+   .unbind = ax88772_unbind,
+   .status = asix_status,
+   .link_reset = ax88772_link_reset,
+   .reset = ax88772_reset,
+   .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_LINK_INTR |
+FLAG_MULTI_PACKET,
+   .rx_fixup = asix_rx_fixup_common,
+   .tx_fixup = asix_tx_fixup,
+   .data = FLAG_EEPROM_MAC,
+};
+
 extern const struct driver_info ax88172a_info;
 
 static const struct usb_device_id  products [] = {
@@ -1063,6 +1086,14 @@ static const struct usb_device_idproducts [] = {
/* ASIX 88172a demo board */
USB_DEVICE(0x0b95, 0x172a),
.driver_info = (unsigned long) _info,
+}, {
+   /*
+* USBLINK HG20F9 "USB 2.0 LAN"
+* Appears to have gazumped Linksys's manufacturer ID but
+* doesn't (yet) conflict with any known Linksys product.
+*/
+   USB_DEVICE(0x066b, 0x20f9),
+   .driver_info = (unsigned long) _info,
 },
{ },// END
 };
-- 
1.7.10.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv2] usb/net/asix_devices: Add USBNET HG20F9 ethernet dongle

2013-02-27 Thread Glen Turner
This USB ethernet adapter was purchased in anodyne packaging
from the computer store adjacent to linux.conf.au 2013 in
Canberra (Australia). A web search shows other recent
purchasers in Lancaster (UK) and Seattle (USA). Just like an
emergent virus, our age of e-commerce and airmail allows
underdocumented hardware to spread around the world instantly
using the vector of ridiculously low prices.

Paige Thompson, infected via eBay, discovered that the HG20F9
is a copy of the Asix 88772B; many viruses copy the RNA of
other viruses. See Paige's work at
https://github.com/paigeadele/HG20F9.
This patch uses her discovery to update the restructured Asix
driver in the current kernel.

Just as some viruses inhabit seemingly-healthy cells, the
HG20F9 uses the Vendor ID 0x066b assigned to Linksys Inc.
For the present there is no clash of Product ID 0x20f9.

Signed-off-by: Glen Turner g...@gdt.id.au
---
David,

My apologies for the patch failing to compile. I worked off Linus' GIT tree
from a week ago, which I now realise was ignorant. Today I downloaded
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git,
modified the patch to suit, compiled with a distributor's .config (ten hours
on my EeePC 901), and re-tested against a range of switches and traffic.

Thanks to Bjørn Mork for the heads-up about comment coding style. I had
copied the style used throughout asix_devices.c but one never knows where
being on the bad side of an automated tool can lead, now that the creator
of the Daleks can no longer protect us from robots with rigid notions
of the acceptable and the exterminatable.

-glen

 drivers/net/usb/asix_devices.c |   31 +++
 1 file changed, 31 insertions(+)

diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
index 2205dbc..7097534 100644
--- a/drivers/net/usb/asix_devices.c
+++ b/drivers/net/usb/asix_devices.c
@@ -924,6 +924,29 @@ static const struct driver_info ax88178_info = {
.tx_fixup = asix_tx_fixup,
 };
 
+/*
+ * USBLINK 20F9 USB 2.0 LAN USB ethernet adapter, typically found in
+ * no-name packaging.
+ * USB device strings are:
+ *   1: Manufacturer: USBLINK
+ *   2: Product: HG20F9 USB2.0
+ *   3: Serial: 03
+ * Appears to be compatible with Asix 88772B.
+ */
+static const struct driver_info hg20f9_info = {
+   .description = HG20F9 USB 2.0 Ethernet,
+   .bind = ax88772_bind,
+   .unbind = ax88772_unbind,
+   .status = asix_status,
+   .link_reset = ax88772_link_reset,
+   .reset = ax88772_reset,
+   .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_LINK_INTR |
+FLAG_MULTI_PACKET,
+   .rx_fixup = asix_rx_fixup_common,
+   .tx_fixup = asix_tx_fixup,
+   .data = FLAG_EEPROM_MAC,
+};
+
 extern const struct driver_info ax88172a_info;
 
 static const struct usb_device_id  products [] = {
@@ -1063,6 +1086,14 @@ static const struct usb_device_idproducts [] = {
/* ASIX 88172a demo board */
USB_DEVICE(0x0b95, 0x172a),
.driver_info = (unsigned long) ax88172a_info,
+}, {
+   /*
+* USBLINK HG20F9 USB 2.0 LAN
+* Appears to have gazumped Linksys's manufacturer ID but
+* doesn't (yet) conflict with any known Linksys product.
+*/
+   USB_DEVICE(0x066b, 0x20f9),
+   .driver_info = (unsigned long) hg20f9_info,
 },
{ },// END
 };
-- 
1.7.10.4



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] usb/net/asix_devices: Add USBNET HG20F9 ethernet dongle

2013-02-25 Thread Glen Turner
This USB ethernet adapter was purchased in anodyne packaging
marked "USB2.0 to LAN" from the computer store adjacent to
linux.conf.au 2013 in Canberra (Australia). A web search
shows other recent purchasers in Lancaster (UK) and Seattle
(USA). Just like an emergent virus, our age of e-commerce and
airmail allows underdocumented hardware to spread around the
world instantly using the vector of ridiculously low prices.

Paige Thompson, infected via eBay, discovered that the HG20F9
is a copy of the Asix 88772B; many viruses copy the RNA of
other viruses. See Paige's work at
<https://github.com/paigeadele/HG20F9>.
This patch uses her discovery to update the restructured Asix
driver in the current kernel.

The spread of viruses is often accompanied by rumours. It is
rumoured that the HG20F9 has extensions to to provide gigabit
ethernet. This patch does not chase that chimera.

Just as some viruses inhabit seemingly-healthy cells, the
HG20F9 uses the Vendor ID 0x066b assigned to Linksys Inc.
For the present there is no clash of Product ID 0x20f9.

Signed-off-by: Glen Turner 
---
 drivers/net/usb/asix_devices.c |   24 
 1 file changed, 24 insertions(+)

diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
index 7a6e758..649025d 100644
--- a/drivers/net/usb/asix_devices.c
+++ b/drivers/net/usb/asix_devices.c
@@ -883,6 +883,24 @@ static const struct driver_info ax88178_info = {
.tx_fixup = asix_tx_fixup,
 };
 
+// USBLINK 20F9 "USB 2.0 LAN" USB ethernet adapter, typically found in
+// no-name packaging.
+// USB device strings are:
+//   1: Manufacturer: USBLINK
+//   2: Product: HG20F9 USB2.0
+//   3: Serial: 03
+// Appears to be compatible with Asix 88772B.
+static const struct driver_info hg20f9_info = {
+   .description = "HG20F9 USB 2.0 Ethernet",
+   .bind = ax88772_bind,
+   .status = asix_status,
+   .link_reset = ax88772_link_reset,
+   .reset = ax88772_reset,
+   .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_LINK_INTR | 
FLAG_MULTI_PACKET,
+   .rx_fixup = asix_rx_fixup,
+   .tx_fixup = asix_tx_fixup,
+};
+
 extern const struct driver_info ax88172a_info;
 
 static const struct usb_device_id  products [] = {
@@ -1022,6 +1040,12 @@ static const struct usb_device_idproducts [] = {
/* ASIX 88172a demo board */
USB_DEVICE(0x0b95, 0x172a),
.driver_info = (unsigned long) _info,
+}, {
+   // USBLINK HG20F9 "USB 2.0 LAN"
+   // Appears to have gazumped Linksys's manufacturer ID but
+   // doesn't (yet) conflict with any known Linksys product.
+   USB_DEVICE(0x066b, 0x20f9),
+   .driver_info = (unsigned long) _info,
 },
{ },// END
 };
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] usb/net/asix_devices: Add USBNET HG20F9 ethernet dongle

2013-02-25 Thread Glen Turner
This USB ethernet adapter was purchased in anodyne packaging
marked USB2.0 to LAN from the computer store adjacent to
linux.conf.au 2013 in Canberra (Australia). A web search
shows other recent purchasers in Lancaster (UK) and Seattle
(USA). Just like an emergent virus, our age of e-commerce and
airmail allows underdocumented hardware to spread around the
world instantly using the vector of ridiculously low prices.

Paige Thompson, infected via eBay, discovered that the HG20F9
is a copy of the Asix 88772B; many viruses copy the RNA of
other viruses. See Paige's work at
https://github.com/paigeadele/HG20F9.
This patch uses her discovery to update the restructured Asix
driver in the current kernel.

The spread of viruses is often accompanied by rumours. It is
rumoured that the HG20F9 has extensions to to provide gigabit
ethernet. This patch does not chase that chimera.

Just as some viruses inhabit seemingly-healthy cells, the
HG20F9 uses the Vendor ID 0x066b assigned to Linksys Inc.
For the present there is no clash of Product ID 0x20f9.

Signed-off-by: Glen Turner g...@gdt.id.au
---
 drivers/net/usb/asix_devices.c |   24 
 1 file changed, 24 insertions(+)

diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
index 7a6e758..649025d 100644
--- a/drivers/net/usb/asix_devices.c
+++ b/drivers/net/usb/asix_devices.c
@@ -883,6 +883,24 @@ static const struct driver_info ax88178_info = {
.tx_fixup = asix_tx_fixup,
 };
 
+// USBLINK 20F9 USB 2.0 LAN USB ethernet adapter, typically found in
+// no-name packaging.
+// USB device strings are:
+//   1: Manufacturer: USBLINK
+//   2: Product: HG20F9 USB2.0
+//   3: Serial: 03
+// Appears to be compatible with Asix 88772B.
+static const struct driver_info hg20f9_info = {
+   .description = HG20F9 USB 2.0 Ethernet,
+   .bind = ax88772_bind,
+   .status = asix_status,
+   .link_reset = ax88772_link_reset,
+   .reset = ax88772_reset,
+   .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_LINK_INTR | 
FLAG_MULTI_PACKET,
+   .rx_fixup = asix_rx_fixup,
+   .tx_fixup = asix_tx_fixup,
+};
+
 extern const struct driver_info ax88172a_info;
 
 static const struct usb_device_id  products [] = {
@@ -1022,6 +1040,12 @@ static const struct usb_device_idproducts [] = {
/* ASIX 88172a demo board */
USB_DEVICE(0x0b95, 0x172a),
.driver_info = (unsigned long) ax88172a_info,
+}, {
+   // USBLINK HG20F9 USB 2.0 LAN
+   // Appears to have gazumped Linksys's manufacturer ID but
+   // doesn't (yet) conflict with any known Linksys product.
+   USB_DEVICE(0x066b, 0x20f9),
+   .driver_info = (unsigned long) hg20f9_info,
 },
{ },// END
 };
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: After many hours all outbound connections get stuck in SYN_SENT

2007-12-20 Thread Glen Turner

> I do have TCP Sequence # Randomization enabled on my router.

Huh?  Do you mean a PIX blade in a Cisco switch-router chassis? It
would be very useful if you could be less vague about the
equipment in use.

>  However,
> if this was causing an issue, wouldn't it always occur and cause
> connection issues, not just after 38 hours of correct operation?

That depends more on your customers' networking attributes
then you are sharing or perhaps even know.  Perhaps your customer
base is very Window-skewed and you simply aren't seeing any Sack
Permitted negotiations for the first 37.999 hours. Or
perhaps you've had a network glitch, and all of your
connections have done a Selective Ack, which the firewall
has trashed, leaving all the connections in a wacko state,
not just a few which you haven't noticed.

The actual failure mode needs a packet trace to determine,
but you should be able to do this yourself (or ask your
local network engineering staff).

If your firewall is trashing the Sack field, then it needs
to be fixed.  Time to raise a case with the Cisco TAC and
ask them directly if your PIX version has bug CSCse14419.
You can't expect Sack to work when it's being fed trash,
so it is important to make sure that is not happening.

Cheers, Glen
#include 
#undef KERNEL_HACKER

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: After many hours all outbound connections get stuck in SYN_SENT

2007-12-20 Thread Glen Turner
[speculation by network engineer -- not kernel hacker -- follows]

> The router could be sooo crappy that it drops all packets from
> TCP streams that have SACK enabled and the client has opened
> 200+ SACK connections previously... something like that?

As far as any third party is concerned the existing TCP connections
continue to have negotiated "SACK Permitted". Only new connections
will not negotiate this.  So "router crappiness" promptly disappearing
doesn't seem too likely (a way I could see this happening is if the
Linux box sends a Ack for each connection and this clears out Sack
datastructures on the third party).

But I'd be very surprised if the router is acting as anything more
that a network-layer device. It might perhaps have some soft connection
state being used for generating accounting records.  Being Cisco
it's probably a switch-router, so it might carry some per-port hard
state for validating source IP addresses and ARPs on each port.

The firewall is much more likely to be carrying per-flow Sack
state. The Cisco PIX had a bug with SACK handling (CSCse14419,
fixed in 7.0(7), 7.1(2.34), 7.2(2.2), 8.0(0.141) but perhaps it
has regressed). A simple trace either side of the firewall will
show the inconsistency between the TCP sequence number (which
gets randomised) and the Sack sequence number (which didn't).
You could disable the TCP Sequence Number Randomisation feature
and see if the fault reoccurs.

You'd probably should also investigate the Linux kernel,
especially the size and locks of the components of the Sack data
structures and what happens to those data structures after Sack is
disabled (presumably the Sack data structure is in some unhappy
circumstance, and disabling Sack allows the data to be discarded,
magically unclaging the box).

In the absence of the reporter wanting to dump the kernel's
core, how about a patch to print the Sack datastructure when
the command to disable Sack is received by the kernel?
Maybe just print the last 16b of the IP address?

Best wishes, Glen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: After many hours all outbound connections get stuck in SYN_SENT

2007-12-20 Thread Glen Turner
[speculation by network engineer -- not kernel hacker -- follows]

 The router could be sooo crappy that it drops all packets from
 TCP streams that have SACK enabled and the client has opened
 200+ SACK connections previously... something like that?

As far as any third party is concerned the existing TCP connections
continue to have negotiated SACK Permitted. Only new connections
will not negotiate this.  So router crappiness promptly disappearing
doesn't seem too likely (a way I could see this happening is if the
Linux box sends a Ack for each connection and this clears out Sack
datastructures on the third party).

But I'd be very surprised if the router is acting as anything more
that a network-layer device. It might perhaps have some soft connection
state being used for generating accounting records.  Being Cisco
it's probably a switch-router, so it might carry some per-port hard
state for validating source IP addresses and ARPs on each port.

The firewall is much more likely to be carrying per-flow Sack
state. The Cisco PIX had a bug with SACK handling (CSCse14419,
fixed in 7.0(7), 7.1(2.34), 7.2(2.2), 8.0(0.141) but perhaps it
has regressed). A simple trace either side of the firewall will
show the inconsistency between the TCP sequence number (which
gets randomised) and the Sack sequence number (which didn't).
You could disable the TCP Sequence Number Randomisation feature
and see if the fault reoccurs.

You'd probably should also investigate the Linux kernel,
especially the size and locks of the components of the Sack data
structures and what happens to those data structures after Sack is
disabled (presumably the Sack data structure is in some unhappy
circumstance, and disabling Sack allows the data to be discarded,
magically unclaging the box).

In the absence of the reporter wanting to dump the kernel's
core, how about a patch to print the Sack datastructure when
the command to disable Sack is received by the kernel?
Maybe just print the last 16b of the IP address?

Best wishes, Glen

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: After many hours all outbound connections get stuck in SYN_SENT

2007-12-20 Thread Glen Turner

 I do have TCP Sequence # Randomization enabled on my router.

Huh?  Do you mean a PIX blade in a Cisco switch-router chassis? It
would be very useful if you could be less vague about the
equipment in use.

  However,
 if this was causing an issue, wouldn't it always occur and cause
 connection issues, not just after 38 hours of correct operation?

That depends more on your customers' networking attributes
then you are sharing or perhaps even know.  Perhaps your customer
base is very Window-skewed and you simply aren't seeing any Sack
Permitted negotiations for the first 37.999 hours. Or
perhaps you've had a network glitch, and all of your
connections have done a Selective Ack, which the firewall
has trashed, leaving all the connections in a wacko state,
not just a few which you haven't noticed.

The actual failure mode needs a packet trace to determine,
but you should be able to do this yourself (or ask your
local network engineering staff).

If your firewall is trashing the Sack field, then it needs
to be fixed.  Time to raise a case with the Cisco TAC and
ask them directly if your PIX version has bug CSCse14419.
You can't expect Sack to work when it's being fed trash,
so it is important to make sure that is not happening.

Cheers, Glen
#include network_engineer.h
#undef KERNEL_HACKER

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Detecting process death for anycast named process monitoring

2007-05-03 Thread Glen Turner
On Thu, 2007-05-03 at 02:40 -0700, Andrew Morton wrote:
> Monitor the system using the taskstats interface.  There is a sample
> application and documentation in Documentation/accounting/.
> 
> Your monitoring application will receive a netlink packet each time a process
> exits.  It includes the exit code and the process's name.

Marvellous, just what is needed.  Thank you Andrew.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Detecting process death for anycast named process monitoring

2007-05-03 Thread Glen Turner

Hi Russell,

Thanks for your answer.

> If you did have a process which polls for the service, what happens if
> that process dies?

The failure mode is good. The monitoring process dies, the
interface stays up, ospfd keeps advertising the route, named
keeps running. We pick up the lack of a monitoring process
in Nagios, manually down lo:1 so the traffic goes elsewhere
and investigate the fault. No customer impact unless
named dies for some independent reason before the NOC staff
down lo:1.

> ... Given that
> you're always going to have another process (which might be killed)
> your thought about having a parent process monitor the death of the
> child seems to be the simplest.

The failure mode for a parent monitoring process is not good.
The monitoring process dies, the interface stays up, ospfd
keeps advertising the route, the child named dies.  Since
we still have incoming DNS requests but no running DNS server,
customers will need to timeout and try the next DNS server
in their /etc/resolv.conf.  So customer impact is severely
reduced performance web performance until the NOC staff log
in and down lo:1.

As you can see, the basic requirement is for the lo:1 interface
to track the state of the named process at all times.

> What if the dbus system dies?  What if your monitoring process dies?

As long as these don't kill named whilst failing, we have enough
time to sort it out manually. Nagios (or whatever system health
monitor you shoose to configure) will hassle the Network Operations
Center in short order.

> Surely a simple solution is going to be the best solution?

That's why I'm posting here. I'd settle for some simple answer,
even if it is particular to Linux.

> You could also have that process interact with a watchog, so failures
> with that process cause a reboot.

No need. Dropping lo:1 makes the DNS traffic go to a healthier
server. Then the box can be left as-is so the sysadmins can take
it apart to find the basic cause of the fault.

Thanks for your thoughts. Some monitoring mechanism that didn't
kill named if it goes wrong would be fantastic.

Glen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Detecting process death for anycast named process monitoring

2007-05-03 Thread Glen Turner

Hi Russell,

Thanks for your answer.

 If you did have a process which polls for the service, what happens if
 that process dies?

The failure mode is good. The monitoring process dies, the
interface stays up, ospfd keeps advertising the route, named
keeps running. We pick up the lack of a monitoring process
in Nagios, manually down lo:1 so the traffic goes elsewhere
and investigate the fault. No customer impact unless
named dies for some independent reason before the NOC staff
down lo:1.

 ... Given that
 you're always going to have another process (which might be killed)
 your thought about having a parent process monitor the death of the
 child seems to be the simplest.

The failure mode for a parent monitoring process is not good.
The monitoring process dies, the interface stays up, ospfd
keeps advertising the route, the child named dies.  Since
we still have incoming DNS requests but no running DNS server,
customers will need to timeout and try the next DNS server
in their /etc/resolv.conf.  So customer impact is severely
reduced performance web performance until the NOC staff log
in and down lo:1.

As you can see, the basic requirement is for the lo:1 interface
to track the state of the named process at all times.

 What if the dbus system dies?  What if your monitoring process dies?

As long as these don't kill named whilst failing, we have enough
time to sort it out manually. Nagios (or whatever system health
monitor you shoose to configure) will hassle the Network Operations
Center in short order.

 Surely a simple solution is going to be the best solution?

That's why I'm posting here. I'd settle for some simple answer,
even if it is particular to Linux.

 You could also have that process interact with a watchog, so failures
 with that process cause a reboot.

No need. Dropping lo:1 makes the DNS traffic go to a healthier
server. Then the box can be left as-is so the sysadmins can take
it apart to find the basic cause of the fault.

Thanks for your thoughts. Some monitoring mechanism that didn't
kill named if it goes wrong would be fantastic.

Glen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Detecting process death for anycast named process monitoring

2007-05-03 Thread Glen Turner
On Thu, 2007-05-03 at 02:40 -0700, Andrew Morton wrote:
 Monitor the system using the taskstats interface.  There is a sample
 application and documentation in Documentation/accounting/.
 
 Your monitoring application will receive a netlink packet each time a process
 exits.  It includes the exit code and the process's name.

Marvellous, just what is needed.  Thank you Andrew.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Detecting process death for anycast named process monitoring

2007-05-02 Thread Glen Turner

Hi folks,

Anycast services are a nice way of robustly offering DNS and other
services.  We create an interface which reflects the availability
of the service and advertise that into the network using a OSPF
router like Quagga.

For more detail see
http://www.aarnet.edu.au/~gdt/presentations/2006-07-18-linuxsa-anycast/
which is a summary of work which was presented at linux.conf.au.

The question is, how can a process with no relationship to another
process detect that process unexpectedly dying?  If named goes
away to a better place, we want to shut down the interface
which causes Quagga to inject the anycast route.

We don't want to be the parent of the running process, because that
doesn't add robustness. If the parent process dies, then the service
dies, and the interface still stays up.

We don't want to poll, because that isn't pretty and the polling
interval needs to be very short on a big ISP's DNS servers.

I have tried using the various notify functions against /proc, but
they don't work for that filesystem. I have tried using notify
against a UNIX domain socket, but notify doesn't work for
that either.

Suggestions, or a patch to support notify for /proc or to push
process death notifications into DBUS or whatever, are welcome.

Thank you, Glen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Detecting process death for anycast named process monitoring

2007-05-02 Thread Glen Turner

Hi folks,

Anycast services are a nice way of robustly offering DNS and other
services.  We create an interface which reflects the availability
of the service and advertise that into the network using a OSPF
router like Quagga.

For more detail see
http://www.aarnet.edu.au/~gdt/presentations/2006-07-18-linuxsa-anycast/
which is a summary of work which was presented at linux.conf.au.

The question is, how can a process with no relationship to another
process detect that process unexpectedly dying?  If named goes
away to a better place, we want to shut down the interface
which causes Quagga to inject the anycast route.

We don't want to be the parent of the running process, because that
doesn't add robustness. If the parent process dies, then the service
dies, and the interface still stays up.

We don't want to poll, because that isn't pretty and the polling
interval needs to be very short on a big ISP's DNS servers.

I have tried using the various notify functions against /proc, but
they don't work for that filesystem. I have tried using notify
against a UNIX domain socket, but notify doesn't work for
that either.

Suggestions, or a patch to support notify for /proc or to push
process death notifications into DBUS or whatever, are welcome.

Thank you, Glen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/